From 758ec64e03f71998789c498f03424eee349353fa Mon Sep 17 00:00:00 2001 From: David Galiffi Date: Wed, 15 May 2024 18:32:59 -0400 Subject: [PATCH] Starting to fix linting errors in markdown files. --- .gitlab/issue_templates/example.md | 10 +- .gitlab/merge_request_templates/example.md | 12 +- .markdownlint.yaml | 10 + AI/MIGraphX/Quantization/README.md | 26 +- ...Running-Quantized-ResNet50-via-MIGraphX.md | 157 +++++------ Applications/README.md | 13 +- Applications/bitonic_sort/README.md | 11 + Applications/convolution/README.md | 17 +- Applications/floyd_warshall/README.md | 14 +- Applications/histogram/README.md | 6 +- Applications/monte_carlo_pi/README.md | 34 ++- Applications/prefix_sum/README.md | 41 ++- Dockerfiles/README.md | 15 +- Docs/CONTRIBUTING.md | 54 +++- HIP-Basic/README.md | 13 +- HIP-Basic/assembly_to_executable/README.md | 36 ++- HIP-Basic/bandwidth/README.md | 9 +- HIP-Basic/bit_extract/README.md | 9 +- HIP-Basic/cooperative_groups/README.md | 16 +- HIP-Basic/device_globals/README.md | 8 + HIP-Basic/device_query/README.md | 20 +- HIP-Basic/dynamic_shared/README.md | 10 + HIP-Basic/events/README.md | 10 +- HIP-Basic/gpu_arch/README.md | 8 + HIP-Basic/hello_world/README.md | 30 ++- HIP-Basic/hipify/README.md | 8 +- HIP-Basic/inline_assembly/README.md | 14 +- HIP-Basic/llvm_ir_to_executable/README.md | 29 ++- HIP-Basic/matrix_multiplication/README.md | 17 +- HIP-Basic/module_api/README.md | 43 +-- HIP-Basic/moving_average/README.md | 7 + HIP-Basic/multi_gpu_data_transfer/README.md | 6 + HIP-Basic/occupancy/README.md | 8 +- HIP-Basic/opengl_interop/README.md | 48 +++- HIP-Basic/runtime_compilation/README.md | 43 +-- HIP-Basic/saxpy/README.md | 21 +- HIP-Basic/shared_memory/README.md | 16 +- HIP-Basic/static_device_library/README.md | 15 ++ HIP-Basic/static_host_library/README.md | 20 ++ HIP-Basic/streams/README.md | 5 + HIP-Basic/texture_management/README.md | 9 +- HIP-Basic/vulkan_interop/README.md | 112 ++++---- HIP-Basic/warp_shuffle/README.md | 5 + LICENSE.md | 2 +- ...-Sanitizer-with-a-Short-HIP-Application.md | 19 +- Libraries/hipBLAS/README.md | 14 +- .../hipBLAS/gemm_strided_batched/README.md | 71 ++--- Libraries/hipBLAS/her/README.md | 29 ++- Libraries/hipBLAS/scal/README.md | 19 +- Libraries/hipCUB/README.md | 25 +- Libraries/hipCUB/device_radix_sort/README.md | 9 +- Libraries/hipCUB/device_sum/README.md | 8 +- Libraries/hipSOLVER/README.md | 14 +- Libraries/hipSOLVER/gels/README.md | 23 +- Libraries/hipSOLVER/geqrf/README.md | 138 ++++++---- Libraries/hipSOLVER/gesvd/README.md | 113 ++++---- Libraries/hipSOLVER/getrf/README.md | 21 +- Libraries/hipSOLVER/potrf/README.md | 24 +- Libraries/hipSOLVER/syevd/README.md | 89 ++++--- Libraries/hipSOLVER/syevdx/README.md | 142 +++++----- Libraries/hipSOLVER/syevj/README.md | 16 +- Libraries/hipSOLVER/syevj_batched/README.md | 47 +++- Libraries/hipSOLVER/sygvd/README.md | 137 ++++++---- Libraries/hipSOLVER/sygvj/README.md | 103 ++++---- Libraries/rocBLAS/README.md | 14 +- Libraries/rocBLAS/level_1/axpy/README.md | 13 +- Libraries/rocBLAS/level_1/dot/README.md | 13 +- Libraries/rocBLAS/level_1/nrm2/README.md | 24 +- Libraries/rocBLAS/level_1/scal/README.md | 22 +- Libraries/rocBLAS/level_1/swap/README.md | 13 +- Libraries/rocBLAS/level_2/gemv/README.md | 17 +- Libraries/rocBLAS/level_2/her/README.md | 13 +- Libraries/rocBLAS/level_3/gemm/README.md | 60 +++-- .../level_3/gemm_strided_batched/README.md | 76 +++--- README.md | 246 ++++++++++-------- 75 files changed, 1719 insertions(+), 870 deletions(-) create mode 100644 .markdownlint.yaml diff --git a/.gitlab/issue_templates/example.md b/.gitlab/issue_templates/example.md index 8d32b8fa..826019b9 100644 --- a/.gitlab/issue_templates/example.md +++ b/.gitlab/issue_templates/example.md @@ -1,12 +1,12 @@ # Example checklist - Elaboration - - [ ] Example concept is described and agreed upon + - [ ] Example concept is described and agreed upon - Implementation - - [ ] Example is implemented + - [ ] Example is implemented - Internal review - - [ ] Internal code review is done + - [ ] Internal code review is done - External review - - [ ] Upstreaming PR is opened, external review is done + - [ ] Upstreaming PR is opened, external review is done - Done - - [ ] Example merged to upstream + - [ ] Example merged to upstream diff --git a/.gitlab/merge_request_templates/example.md b/.gitlab/merge_request_templates/example.md index 1221d302..09132083 100644 --- a/.gitlab/merge_request_templates/example.md +++ b/.gitlab/merge_request_templates/example.md @@ -1,16 +1,18 @@ ## Notes for the reviewer + _The reviewer should acknowledge all these topics._ ## Checklist before merge + - [ ] CMake support is added - - [ ] Dependencies are copied via `IMPORTED_RUNTIME_ARTIFACTS` if applicable + - [ ] Dependencies are copied via `IMPORTED_RUNTIME_ARTIFACTS` if applicable - [ ] GNU Make support is added (Linux) - [ ] Visual Studio project is added for VS2017, 2019, 2022 (Windows) (use [the script](https://projects.streamhpc.com/departments/knowledge/employee-handbook/-/wikis/Projects/AMD/Libraries/examples/Adding-Visual-Studio-Projects-to-new-examples#scripts)) - - [ ] DLL dependencies are copied via `` sets `length` as the number of elements of the array that will be sorted. It must be a power of $2$. Its default value is $2^{15}$. - `-s ` sets `sort` as the type or sorting that we want our array to have: decreasing ("dec") or increasing ("inc"). The default value is "inc". ## Key APIs and Concepts + - Device memory is allocated with `hipMalloc` and deallocated with `hipFree`. + - With `hipMemcpy` data bytes can be transferred from host to device (using `hipMemcpyHostToDevice`) or from device to host (using `hipMemcpyDeviceToHost`). + - `hipEventCreate` creates events, which are used in this example to measure the kernels execution time. `hipEventRecord` starts recording an event, `hipEventSynchronize` waits for all the previous work in the stream when the specified event was recorded. With these three functions it can be measured the start and stop times of the kernel and with `hipEventElapsedTime` it can be obtained the kernel execution time in milliseconds. Lastly, `hipEventDestroy` destroys an event. + - `myKernelName<<<...>>>` queues kernel execution on the device. All the kernels are launched on the `hipStreamDefault`, meaning that these executions are performed in order. `hipGetLastError` returns the last error produced by any runtime API call, allowing to check if any kernel launch resulted in error. ## Demonstrated API Calls ### HIP runtime + #### Device symbols + - `blockDim` - `blockIdx` - `threadIdx` #### Host symbols + - `__global__` - `hipEvent_t` - `hipEventCreate` diff --git a/Applications/convolution/README.md b/Applications/convolution/README.md index f0786ce6..5099d23a 100644 --- a/Applications/convolution/README.md +++ b/Applications/convolution/README.md @@ -1,11 +1,13 @@ # Applications Convolution Example ## Description + This example showcases a simple GPU implementation for calculating the [discrete convolution](https://en.wikipedia.org/wiki/Convolution#Discrete_convolution). The key point of this implementation is that in the GPU kernel each thread calculates the value for a convolution for a given element in the resulting grid. For storing the mask constant memory is used. Constant memory is a read-only memory that is limited in size, but offers faster access times than regular memory. Furthermore on some architectures it has a separate cache. Therefore accessing constant memory can reduce the pressure on the memory system. ### Application flow + 1. Default values for the size of the grid, mask and the number of iterations for the algorithm execution are set. 2. Command line arguments are parsed. 3. Host memory is allocated for the input, output and the mask. Input data is initialized with random numbers between 0-256. @@ -17,7 +19,9 @@ For storing the mask constant memory is used. Constant memory is a read-only mem 9. In case requested the convoluted grid, the input grid, and the reference results are printed to standard output. ### Command line interface + There are three parameters available: + - `-h` displays information about the available parameters and their default values. - `-x width` sets the grid size in the x direction. Default value is 4096. - `-y height` sets the grid size in the y direction. Default value is 4096. @@ -25,22 +29,31 @@ There are three parameters available: - `-i iterations` sets the number of times that the algorithm will be applied to the (same) grid. It must be an integer greater than 0. Its default value is 10. ## Key APIs and Concepts -- For this GPU implementation of the simple convolution calculation, the main kernel (`convolution`) is launched in a 2-dimensional grid. Each thread computes the convolution for one element of the resulting grid. + +- For this GPU implementation of the simple convolution calculation, the main kernel (`convolution`) is launched in a 2-dimensional grid. Each thread computes the convolution for one element of the resulting grid. + - Device memory is allocated with `hipMalloc` which is later freed by `hipFree`. -- Constant memory is declared in global scope for the mask, using the `__constant__` qualifier. The size of the object stored in constant memory must be available at compile time. Later the memory is initialized with `hipMemcpyToSymbol`. + +- Constant memory is declared in global scope for the mask, using the `__constant__` qualifier. The size of the object stored in constant memory must be available at compile time. Later the memory is initialized with `hipMemcpyToSymbol`. + - With `hipMemcpy` data can be transferred from host to device (using `hipMemcpyHostToDevice`) or from device to host (using `hipMemcpyDeviceToHost`). + - `myKernelName<<<...>>>` queues the kernel execution on the device. All the kernels are launched on the default stream `hipStreamDefault`, meaning that these executions are performed in order. `hipGetLastError` returns the last error produced by any runtime API call, allowing to check if any kernel launch resulted in an error. + - `hipEventCreate` creates the events used to measure kernel execution time, `hipEventRecord` starts recording an event and `hipEventSynchronize` waits for all the previous work in the stream when the specified event was recorded. These three functions can be used to measure the start and stop times of the kernel, and with `hipEventElapsedTime` the kernel execution time (in milliseconds) can be obtained. With `hipEventDestroy` the created events are freed. ## Demonstrated API Calls ### HIP runtime + #### Device symbols + - `blockIdx` - `blockDim` - `threadIdx` #### Host symbols + - `__global__` - `__constant__` - `hipEventCreate` diff --git a/Applications/floyd_warshall/README.md b/Applications/floyd_warshall/README.md index 60e595ae..d567121c 100644 --- a/Applications/floyd_warshall/README.md +++ b/Applications/floyd_warshall/README.md @@ -1,6 +1,7 @@ # Applications Floyd-Warshall Example ## Description + This example showcases a GPU implementation of the [Floyd-Warshall algorithm](https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm), which computes the shortest path between each pair of nodes in a given directed and (in this case) complete graph $G = (V, E, \omega)$. The key point of this implementation is that each kernel launch represents a step $k$ of the traditional CPU-implemented algorithm. Therefore, the kernel is launched as much times as nodes $\left(n = \vert V \vert \right)$ has the graph. In this example, there are `iterations` (consecutive) executions of the algorithm on the same graph. As each execution requires an unmodified graph input, multiple copy operations are required. Hence, the performance of the example can be improved by using _pinned memory_. @@ -10,6 +11,7 @@ Pinned memory is simply a special kind of memory that cannot be paged out the ph Therefore, using pinned memory saves significant time needed to copy from/to host memory. In this example, performances is improved by using this type of memory, given that there are `iterations` (consecutive) executions of the algorithm on the same graph. ### Application flow + 1. Default values for the number of nodes of the graph and the number of iterations for the algorithm execution are set. 2. Command line arguments are parsed (if any) and the previous values are updated. 3. A number of constants are defined for kernel execution and input/output data size. @@ -20,30 +22,40 @@ Therefore, using pinned memory saves significant time needed to copy from/to hos 8. The mean time in milliseconds needed for each iteration is printed to standard output. 9. The results obtained are compared with the CPU implementation of the algorithm. The result of the comparison is printed to the standard output. - ### Command line interface + There are three parameters available: + - `-h` displays information about the available parameters and their default values. - `-n nodes` sets `nodes` as the number of nodes of the graph to which the Floyd-Warshall algorithm will be applied. It must be a (positive) multiple of `block_size` (= 16). Its default value is 16. - `-i iterations` sets `iterations` as the number of times that the algorithm will be applied to the (same) graph. It must be an integer greater than 0. Its default value is 1. ## Key APIs and Concepts + - For this GPU implementation of the Floyd-Warshall algorithm, the main kernel (`floyd_warshall_kernel`) that is launched in a 2-dimensional grid. Each thread in the grid computes the shortest path between two nodes of the graph at a certain step $k$ $\left(0 \leq k < n \right)$. The threads compare the previously computed shortest paths using only the nodes in $V'=\{v_0,v_1,...,v_{k-1}\} \subseteq V$ as intermediate nodes with the paths that include node $v_k$ as an intermediate node, and take the shortest option. Therefore, the kernel is launched $n$ times. + - For improved performance, pinned memory is used to pass the results obtained in each iteration to the next one. With `hipHostMalloc` pinned host memory (accessible by the device) can be allocated, and `hipHostFree` frees it. In this example, host pinned memory is allocated using the `hipHostMallocMapped` flag, which indicates that `hipHostMalloc` must map the allocation into the address space of the current device. Beware that an excessive allocation of pinned memory can slow down the host execution, as the program is left with less physical memory available to map the rest of the virtual addresses used. + - Device memory is allocated using `hipMalloc` which is later freed using `hipFree` + - With `hipMemcpy` data bytes can be transferred from host to device (using `hipMemcpyHostToDevice`) or from device to host (using `hipMemcpyDeviceToHost`), among others. + - `myKernelName<<<...>>>` queues the kernel execution on the device. All the kernels are launched on the `hipStreamDefault`, meaning that these executions are performed in order. `hipGetLastError` returns the last error produced by any runtime API call, allowing to check if any kernel launch resulted in error. + - `hipEventCreate` creates the events used to measure kernel execution time, `hipEventRecord` starts recording an event and `hipEventSynchronize` waits for all the previous work in the stream when the specified event was recorded. With these three functions it can be measured the start and stop times of the kernel, and with `hipEventElapsedTime` the kernel execution time (in milliseconds) can be obtained. ## Demonstrated API Calls ### HIP runtime + #### Device symbols + - `blockIdx` - `blockDim` - `threadIdx` #### Host symbols + - `__global__` - `hipEventCreate` - `hipEventDestroy` diff --git a/Applications/histogram/README.md b/Applications/histogram/README.md index edfb59b5..54216bd8 100644 --- a/Applications/histogram/README.md +++ b/Applications/histogram/README.md @@ -1,6 +1,7 @@ # Applications: Histogram Example ## Description + This program showcases a GPU kernel and its invocation of a histogram computation over a byte (`unsigned char`) array. A histogram constructs a table with the counts of each discrete value. The diagram below showcases a 4 bin histogram over an 8-element long array: @@ -14,8 +15,8 @@ This is solved by striding over the input such a way that each thread accesses a ![A diagram illustrating bank conflicts and solution using striding.](bank_conflict_reduction.svg) - ### Application flow + 1. Define and allocate inputs and outputs on host. 2. Allocate the memory on device and copy the input. 3. Launch the histogram kernel. @@ -24,6 +25,7 @@ This is solved by striding over the input such a way that each thread accesses a 6. Verify the results on host. ### Key APIs and concepts + - _Bank conflicts._ Memory is stored across multiple banks. Elements in banks are stored in 4-byte words. Each thread within a wavefront should access different banks to ensure high throughput. - `__ffs(int input)` finds the 1-index of the first set least significant bit of the input. - `__syncthreads()` halts this thread until all threads within the same block have reached this point. @@ -34,6 +36,7 @@ This is solved by striding over the input such a way that each thread accesses a ### HIP runtime #### Device symbols + - `blockDim` - `blockIdx` - `threadIdx` @@ -42,6 +45,7 @@ This is solved by striding over the input such a way that each thread accesses a - `__shared__` #### Host symbols + - `__global__` - `hipEvent_t` - `hipEventCreate` diff --git a/Applications/monte_carlo_pi/README.md b/Applications/monte_carlo_pi/README.md index 77ade824..0d5ceb8d 100644 --- a/Applications/monte_carlo_pi/README.md +++ b/Applications/monte_carlo_pi/README.md @@ -1,6 +1,7 @@ # Applications Monte Carlo Pi Example ## Description + This example demonstrates how the mathematical constant pi ($\pi$) can be approximated using Monte Carlo integration. Monte Carlo integration approximates integration of a function by generating random values over a domain that is the superset of the function's domain. Using the ratio between the number of samples in both domains and the range of the random values, the integral is approximated. The area of a disk is given by $r^2\pi$, where $r$ is the radius of a disk. Uniform random values are typically generated in the range $(0,1]$. Using a disk of radius $1$ centered on the origin, a sample point is in the disk if it's distance to the origin is less than $1$. The ratio between the number of sample points within the disk and the total sample points is an approximation of the ratio between the area of the disk and the quadrant $(0,1]\times(0,1]$, which is $\frac{\pi}{4}$. Multiplying the sample point ratio by $4$ approximates the value of pi. @@ -10,43 +11,58 @@ To generate a large number of random samples we use hipRAND, a platform-independ To compute the number of sample points that lie within the disk, we use hipCUB, which is a platform-independent library providing GPU primitives. For each sample, we are looking to compute whether it lies in the disk, and to count the number of samples for which this is the case. Using and indicator function and `TransformInputIterator`, an iterator is created which outputs a zero or one for each sample. Using `DeviceReduce::Sum`, the sum over the iterator's values is computed. ### Application flow + 1. Parse and validate user input. 2. Allocate device memory to store the random values. Since the samples are two-dimensional, two random values are required per sample. 3. Initialize hipRAND's default pseudorandom-number generator and generate the required number of values. 4. Allocate and initialize the input and output for hipCUB's `DeviceReduce::Sum`: - 1. Create a `hipcub::CountingInputIterator` that starts from `0`, which will represent the sample index. - 2. Create a `hipcub::TransformInputIterator` that uses the sample index to obtain the sample's coordinates from the + + a) Create a `hipcub::CountingInputIterator` that starts from `0`, which will represent the sample index. + + b) Create a `hipcub::TransformInputIterator` that uses the sample index to obtain the sample's coordinates from the array of random numbers, and computes whether it lies within the disk. This iterator will be the input for the device function. - 3. Allocate device memory for the variable that stores the output of the function. + + c) Allocate device memory for the variable that stores the output of the function. + 5. Calculate the required amount of temporary storage, and allocate it. 6. Calculate the number of samples within the disk with `hipcub::DeviceReduce::Sum`. 7. Copy the result back to the host and calculate pi. 8. Clean up the generator and print the result. -9. Initialize hipRAND's default quasirandom-number generator, set the dimensions to two, and generate the required - number of values. Note that the first half of the array will be the first dimension, the second half will be the - second dimension. + +9. Initialize hipRAND's default quasirandom-number generator, set the dimensions to two, and generate the required number of values. + + Note that the first half of the array will be the first dimension, the second half will be the second dimension. + 10. Repeat steps 4. - 8. for the quasirandom values. ### Command line interface + - `-s ` or `-sample_count ` sets the number of samples used, the default is $2^{20}$. ## Key APIs and Concepts -- To start using hipRAND, a call to `hiprandCreateGenerator` with a generator type is made. - - To pick any of hipRAND's pseudorandom-number generators, we use type `HIPRAND_RNG_PSEUDO_DEFAULT`. For pseudorandom-number generators, the seed can be set with `hiprandSetPseudoRandomGeneratorSeed`. + +- To start using hipRAND, a call to `hiprandCreateGenerator` with a generator type is made. + + - To pick any of hipRAND's pseudorandom-number generators, we use type `HIPRAND_RNG_PSEUDO_DEFAULT`. For pseudorandom-number generators, the seed can be set with `hiprandSetPseudoRandomGeneratorSeed`. - We use type `HIPRAND_RNG_QUASI_DEFAULT` to create a quasirandom-number generator. For quasirandom-number generators, the number of dimensions can be set with `hiprandSetQuasiRandomGeneratorDimensions`. For this example, we calculate an area, so our domain consists of two dimensions. Destroying the hipRAND generator is done with `hiprandDestroyGenerator`. + - hipCUB itself requires no initialization, but each of its functions must be called twice. The first call must have a null-valued temporary storage argument, the call sets the required storage size. The second call performs the actual operation with the user-allocated memory. + - hipCUB offers a number of iterators for convenience: + - `hipcub::CountingInputIterator` will act as an incrementing sequence starting from a specified index. - `hipcub::TransformInputIterator` takes an iterator and applies a user-defined function on it. + - hipCUB's `DeviceReduce::Sum` computes the sum over the input iterator and outputs a single value to the output iterator. ## Demonstrated API Calls ### HIP runtime + - `__device__` - `__forceinline__` - `__host__` @@ -64,6 +80,7 @@ To compute the number of sample points that lie within the disk, we use hipCUB, - `hipStreamDefault` ### hipRAND + - `HIPRAND_RNG_PSEUDO_DEFAULT` - `HIPRAND_RNG_QUASI_DEFAULT` - `HIPRAND_STATUS_SUCCESS` @@ -76,6 +93,7 @@ To compute the number of sample points that lie within the disk, we use hipCUB, - `hiprandStatus_t` ### hipCUB + - `hipcub::CountingInputIterator` - `hipcub::DeviceReduce::Sum` - `hipcub::TransformInputIterator` diff --git a/Applications/prefix_sum/README.md b/Applications/prefix_sum/README.md index 5ee106d6..ff65d275 100644 --- a/Applications/prefix_sum/README.md +++ b/Applications/prefix_sum/README.md @@ -1,6 +1,7 @@ # Applications: Prefix Sum Example ## Description + This example showcases a GPU implementation of a prefix sum via a scan algorithm. This example does not use the scan or reduce methods from rocPRIM or hipCUB (`hipcub::DeviceScan::ExclusiveScan`) which could provide improved performance. @@ -8,37 +9,53 @@ For each element in the input, prefix sum calculates the sum from the beginning $a_n = \sum^{n}_{m=0} A[m]$ -The algorithm used has two phases which are repeated: - A) the block wide prefix sum which uses a two pass prefix sum algorithm as described in _Prefix Sums and Their Applications_ (Blelloch, 1988). - B) the device wide prefix sum which propagates values from one block to others. +The algorithm used has two phases which are repeated: + + a) the block wide prefix sum which uses a two pass prefix sum algorithm as described in _Prefix Sums and Their Applications_ (Blelloch, 1988). + + b) the device wide prefix sum which propagates values from one block to others. -Below is an example where the threads per block is 2. -In the first iteration ($\text{offset}=1$) we have 4 threads combining 8 items. +Below is an example where the threads per block is 2. +In the first iteration ($\text{offset}=1$) we have 4 threads combining 8 items. ![](prefix_sum_diagram.svg) ### Application flow + 1. Parse user input. 2. Generate input vector. 3. Calculate the prefix sum. - 1. Define the kernel constants. - 2. Declare and allocate device memory. - 3. Copy the input from host to device - 4. Sweep over the input, multiple times if needed. - 5. Copy the results from device to hsot. - 6. Clean up device memory allocations. + + a) Define the kernel constants. + + b) Declare and allocate device memory. + + c) Copy the input from host to device + + d) Sweep over the input, multiple times if needed. + + e) Copy the results from device to host. + + f) Clean up device memory allocations. + 4. Verify the output. ### Command line interface + The application has an optional argument: + - `-n ` with size of the array to run the prefix sum over. The default value is `256`. ### Key APIs and concepts + - Device memory is managed with `hipMalloc` and `hipFree`. The former sets the pointer to the allocated space and the latter frees this space. + - `myKernel<<<...>>>()` launches the kernel named `myKernel`. In this example the kernels `block_prefix_sum` and `device_prefix_sum` are launched. `block_prefix_sum` requires shared memory which is passed along in the kernel launch. + - `extern __shared__ float[]` in the kernel code denotes an array in shared memory which can be accessed by all threads in the same block. + - `__syncthreads()` blocks this thread until all threads within the current block have reached this point. This is to ensure no unwanted read-after-write, write-after-write, or write-after-read situations occur. @@ -47,6 +64,7 @@ The application has an optional argument: ### HIP runtime #### Device symbols + - `blockDim` - `blockIdx` - `threadIdx` @@ -54,6 +72,7 @@ The application has an optional argument: - `__shared__` #### Host symbols + - `__global__` - `hipFree()` - `hipMalloc()` diff --git a/Dockerfiles/README.md b/Dockerfiles/README.md index cfe347da..b61ee204 100644 --- a/Dockerfiles/README.md +++ b/Dockerfiles/README.md @@ -4,22 +4,27 @@ This folder hosts Dockerfiles with ready-to-use environments for the various sam Each sample describes which environment it can be used with. ## Building + From this folder execute -``` + +``` bash docker build . -f -t ``` ## List of Dockerfiles + ### HIP libraries on the ROCm platform based on Ubuntu + Dockerfile: [hip-libraries-rocm-ubuntu.Dockerfile](hip-libraries-rocm-ubuntu.Dockerfile) -This is environment is based on Ubuntu targeting the ROCm platform. It has the HIP runtime and -the ROCm libraries installed. CMake is also installed in the image. +This is environment is based on Ubuntu targeting the ROCm platform. It has the +HIP runtime and the ROCm libraries installed. CMake is also installed in the image. It can be used with most of the samples when running on a ROCm target. ### HIP libraries on the CUDA platform based on Ubuntu + Dockerfile: [hip-libraries-cuda-ubuntu.Dockerfile](hip-libraries-cuda-ubuntu.Dockerfile) -This is environment is based on Ubuntu targeting the CUDA platform. It has the HIP runtime and -the ROCm libraries installed. CMake is also installed in the image. +This is environment is based on Ubuntu targeting the CUDA platform. It has the +HIP runtime and the ROCm libraries installed. CMake is also installed in the image. It can be used with the samples that support the CUDA target. diff --git a/Docs/CONTRIBUTING.md b/Docs/CONTRIBUTING.md index 24c5d2a8..89adf47a 100644 --- a/Docs/CONTRIBUTING.md +++ b/Docs/CONTRIBUTING.md @@ -1,33 +1,61 @@ # Guidelines -To keep the style of the examples consistent, please follow the following guidelines when implementing your example. + +To keep the style of the examples consistent, please follow the following +guidelines when implementing your example. ## Make/CMake -Each example has to at least support `CMake` as build system. The simpler examples should also support `Make`.
-Every example has to be able to be built separately from the others, but also has to be added to the top-level build scripts. + +Each example has to at least support `CMake` as build system. +The simpler examples should also support `Make`.
+Every example has to be able to be built separately from the others, +but also has to be added to the top-level build scripts. ## Code Format -The formatting rules of the examples are enforced by `clang-format` using the `.clang-format` file in the top-level directory. + +The formatting rules of the examples are enforced by `clang-format` using the +`.clang-format` file in the top-level directory. ## Variable Naming Conventions -- Use `lower_snake_case` style to name variables and functions (e.g. block_size, multiply_kernel and multiply_host). + +- Use `lower_snake_case` style to name variables and functions (e.g. block_size, +multiply_kernel and multiply_host). - Use `PascalCase` for `class`, `struct`, `enum` and template argument definitions. ## File and Directory Naming Conventions + - Top-level directories use `PascalCase`. -- The directories in Libraries/ should use the exact name of the library they represent, including casing. If any directory does not represent a library, it should named in `camelCase`. +- The directories in Libraries/ should use the exact name of the library they +represent, including casing. If any directory does not represent a library, it +should named in `camelCase`. - Directories for individual examples use `snake_case`. -- Files generally use `snake_case`, with the exception of files for which an existing convention already applies (`README.md`, `LICENSE.md`, `CMakeLists.txt`, etc). -- Example binaries should be prefixed with the library name of the binary, so that there are no conflicts between libraries (e.g. `hipcub_device_sum` and `rocprim_device_sum`). +- Files generally use `snake_case`, with the exception of files for which an +existing convention already applies (`README.md`, `LICENSE.md`, `CMakeLists.txt`, + etc). +- Example binaries should be prefixed with the library name of the binary, so +hat there are no conflicts between libraries (e.g. `hipcub_device_sum` and +`rocprim_device_sum`). ## Utilities -Utility-functions (printing vectors, etc) and common error-handling code, that is used by all examples, should be moved to the common utility-header [example_utils.hpp](../Common/example_utils.hpp). + +Utility-functions (printing vectors, etc) and common error-handling code, that +is used by all examples, should be moved to the common utility-header +[example_utils.hpp](../Common/example_utils.hpp). ## Error Handling -Error checking and handling should be applied where appropriate, e.g. when handling user input. `HIP_CHECK` should be used whenever possible. Exceptions should only be used if the complexity of the program requires it.
-In most cases printing an explanation to stderr and terminating the program with an error code, as specified in the common header, is sufficient. + +Error checking and handling should be applied where appropriate, e.g. when +handling user input. `HIP_CHECK` should be used whenever possible. Exceptions +should only be used if the complexity of the program requires it.
+In most cases printing an explanation to stderr and terminating the program with +an error code, as specified in the common header, is sufficient. ## Printing Intermediate Results -Results should be printed when they are helpful for the understanding and showcasing the example. However the output shouldn't be overwhelming, printing a vector with hundreds of entries is usually not useful. + +Results should be printed when they are helpful for the understanding and +showcasing the example. However the output shouldn't be overwhelming, printing +a vector with hundreds of entries is usually not useful. ## .gitignore -A .gitignore file is required in every example subdirectory to exclude the binary generated when using Make. + +A .gitignore file is required in every example subdirectory to exclude the +binary generated when using Make. diff --git a/HIP-Basic/README.md b/HIP-Basic/README.md index 3f79faf0..26881fad 100644 --- a/HIP-Basic/README.md +++ b/HIP-Basic/README.md @@ -1,26 +1,33 @@ # HIP-Basic Examples ## Summary + The examples in this subdirectory showcase the functionality of the HIP runtime. The examples build on Linux for the ROCm (AMD GPU) backend. Some examples additionally support Windows, some examples additionally support the CUDA (NVIDIA GPU) backend. ## Prerequisites + ### Linux + - [CMake](https://cmake.org/download/) (at least version 3.21) - OR GNU Make - available via the distribution's package manager - [ROCm](https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.1.3/page/Overview_of_ROCm_Installation_Methods.html) (at least version 5.x.x) ### Windows + - [Visual Studio](https://visualstudio.microsoft.com/) 2019 or 2022 with the "Desktop Development with C++" workload - ROCm toolchain for Windows (No public release yet) - - The Visual Studio ROCm extension needs to be installed to build with the solution files. + - The Visual Studio ROCm extension needs to be installed to build with the solution files. - [CMake](https://cmake.org/download/) (optional, to build with CMake. Requires at least version 3.21) - [Ninja](https://ninja-build.org/) (optional, to build with CMake) ## Building + ### Linux + Make sure that the dependencies are installed, or use one of the [provided Dockerfiles](../../Dockerfiles/) to build and run the examples in a containerized environment. #### Using CMake + All examples in the `HIP-Basic` subdirectory can either be built by a single CMake project or be built independently. - `$ cd Libraries/HIP-Basic` @@ -28,18 +35,22 @@ All examples in the `HIP-Basic` subdirectory can either be built by a single CMa - `$ cmake --build build` #### Using Make + All examples can be built by a single invocation to Make or be built independently. - `$ cd Libraries/HIP-Basic` - `$ make` (on ROCm) or `$ make GPU_RUNTIME=CUDA` (on CUDA, when supported) ### Windows + Not all HIP runtime examples support building on Windows. See the README file in the directory of the example for more details. #### Visual Studio + Visual Studio solution files are available for the individual examples. To build all supported HIP runtime examples open the top level solution file [ROCm-Examples-VS2019.sln](../../ROCm-Examples-VS2019.sln) and filter for HIP-Basic. For more detailed build instructions refer to the top level [README.md](../../README.md#visual-studio). #### CMake + All examples in the `HIP-Basic` subdirectory can either be built by a single CMake project or be built independently. For build instructions refer to the top-level [README.md](../../README.md#cmake-2). diff --git a/HIP-Basic/assembly_to_executable/README.md b/HIP-Basic/assembly_to_executable/README.md index 87e287ee..b8db9b98 100644 --- a/HIP-Basic/assembly_to_executable/README.md +++ b/HIP-Basic/assembly_to_executable/README.md @@ -1,6 +1,7 @@ # HIP-Basic Assembly to Executable Example ## Description + This example shows how to manually compile and link a HIP application from device assembly. Pre-generated assembly files are compiled into an _offload bundle_, a bundle of device object files, and then linked with the host object file to produce the final executable. Building HIP executables from device assembly can be useful for example to experiment with specific instructions, perform specific optimizations, or can help debugging. @@ -8,19 +9,25 @@ Building HIP executables from device assembly can be useful for example to exper ### Building - Build with Makefile: to compile for specific GPU architectures, optionally provide the HIP_ARCHITECTURES variable. Provide the architectures separated by comma. + ```shell make HIP_ARCHITECTURES="gfx803;gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102" ``` + - Build with CMake: + ```shell cmake -S . -B build -DCMAKE_HIP_ARCHITECTURES="gfx803;gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102" cmake --build build ``` + On Windows the path to RC compiler may be needed: `-DCMAKE_RC_COMPILER="C:/Program Files (x86)/Windows Kits/path/to/x64/rc.exe"` - HIP SDK for window does not support HIP device architecture gfx942. + HIP SDK for window does not support HIP device architecture gfx942. ## Generating device assembly + This example creates a HIP file from device assembly code, however, such assembly files can also be created from HIP source code using `hipcc`. This can be done by passing `-S` and `--cuda-device-only` to hipcc. The former flag instructs the compiler to generate human-readable assembly instead of machine code, and the latter instruct the compiler to only compile the device part of the program. The six assembly files for this example were generated as follows: + ```shell $ROCM_INSTALL_DIR/bin/hipcc -S --cuda-device-only --offload-arch=gfx803 --offload-arch=gfx900 --offload-arch=gfx906 --offload-arch=gfx908 --offload-arch=gfx90a --offload-arch=gfx942 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 main.hip ``` @@ -28,9 +35,11 @@ $ROCM_INSTALL_DIR/bin/hipcc -S --cuda-device-only --offload-arch=gfx803 --offloa The user may modify the `--offload-arch` flag to build for other architectures and choose to either enable or disable extra device code-generation features such as `xnack` or `sram-ecc`, which can be specified as `--offload-arch=:+` to enable it or `--offload-arch=:-` to disable it. Multiple features may be present, separated by colons. ## Build Process + A HIP binary consists of a regular host executable, which has an offload bundle containing device code embedded inside it. This offload bundle contains object files for each of the target devices that it is compiled for, and is loaded at runtime to provide the machine code for the current device. A HIP executable can be built from device assembly files and host HIP code according to the following process: 1. The `main.hip` file is compiled to an object file that only contains host code with `hipcc` by using the `--cuda-host-only` option. `main.hip` is a program that launches a simple kernel to compute the square of each element of a vector. The `-c` option is required to prevent the compiler from creating an executable, and make it create an object file containing the compiled host code instead. + ```shell $ROCM_INSTALL_DIR/bin/hipcc -c --cuda-host-only main.hip ``` @@ -56,6 +65,7 @@ A HIP binary consists of a regular host executable, which has an offload bundle Note: using -bundle-align=4096 only works on ROCm 4.0 and newer compilers. Also, the architecture must match the same `--offload-arch` as when compiling to assembly. 4. The offload bundle is embedded inside an object file that can be linked with the object file containing the host code. The offload bundle must be placed in the `.hip_fatbin` section, and must be placed after the symbol `__hip_fatbin`. This can be done by creating an assembly file that places the offload bundle in the appropriate section using the `.incbin` directive: + ```nasm .type __hip_fatbin,@object ; Tell the assembler to place the offload bundle in the appropriate section. @@ -68,31 +78,41 @@ A HIP binary consists of a regular host executable, which has an offload bundle ; Include the binary .incbin "offload_bundle.hipfb" ``` + This file can then be assembled using `llvm-mc` as follows: - ``` + + ```shell $ROCM_INSTALL_DIR/llvm/bin/llvm-mc -triple -o main_device.o hip_obj_gen.mcin --filetype=obj ``` 5. Finally, using the system linker, hipcc, or clang, the host object and device objects are linked into an executable: + ```shell /hip/bin/hipcc -o hip_assembly_to_executable main.o main_device.o ``` ### Visual Studio 2019 + The above compilation steps are implemented in Visual Studio through Custom Build Steps and Custom Build Tools: + - The host compilation from step 1 is performed by adding extra options to the source file, under `main.hip -> properties -> C/C++ -> Command Line`: - ``` + + ```shell Additional Options: --cuda-host-only ``` + - Each device assembly .s file has a custom build tool associated to it, which performs the operation associated to step 2 from the previous section: - ``` + + ```shell Command Line: "$(ClangToolPath)clang++" -o "$(IntDir)%(FileName).o" "%(Identity)" -target amdgcn-amd-amdhsa -mcpu=gfx90a Description: Compiling Device Assembly %(Identity) Output: $(IntDir)%(FileName).o Execute Before: ClCompile ``` + - Steps 3 and 4 are implemented using a custom build step: - ``` + + ```shell Command Line: "$(ClangToolPath)clang-offload-bundler" -type=o -bundle-align=4096 -targets=host-x86_64-pc-windows-msvc,hipv4-amdgcn-amd-amdhsa--gfx803,hipv4-amdgcn-amd-amdhsa--gfx900,hipv4-amdgcn-amd-amdhsa--gfx906,hipv4-amdgcn-amd-amdhsa--gfx908,hipv4-amdgcn-amd-amdhsa--gfx90a,hipv4-amdgcn-amd-amdhsa--gfx1030,hipv4-amdgcn-amd-amdhsa--gfx1100,hipv4-amdgcn-amd-amdhsa--gfx1101,hipv4-amdgcn-amd-amdhsa--gfx1102 -input=nul "-input=$(IntDir)main_gfx803.o" "-input=$(IntDir)main_gfx900.o" "-input=$(IntDir)main_gfx906.o" "-input=$(IntDir)main_gfx908.o" "-input=$(IntDir)main_gfx90a.o" "-input=$(IntDir)main_gfx1030.o" "-input=$(IntDir)main_gfx1100.o" "-input=$(IntDir)main_gfx1101.o" "-input=$(IntDir)main_gfx1102.o" "-output=$(IntDir)offload_bundle.hipfb" cd $(IntDir) && "$(ClangToolPath)llvm-mc" -triple host-x86_64-pc-windows-msvc "hip_obj_gen_win.mcin" -o "main_device.obj" --filetype=obj @@ -101,8 +121,10 @@ The above compilation steps are implemented in Visual Studio through Custom Buil Additional Dependencies: $(IntDir)main_gfx803.o;$(IntDir)main_gfx900.o;$(IntDir)main_gfx906.o;$(IntDir)main_gfx908.o;$(IntDir)main_gfx90a.o;$(IntDir)main_gfx1030.o;$(IntDir)main_gfx1100.o;$(IntDir)main_gfx1101.o;$(IntDir)main_gfx1102.o;$(IntDir)hip_objgen_win.mcin;%(Inputs) Execute Before: ClCompile ``` + - Finally step 5 is implemented by passing additional inputs to the linker in `project -> properties -> Linker -> Input`: - ``` + + ```shell Additional Dependencies: $(IntDir)main_device.obj;%(AdditionalDependencies) ``` @@ -116,7 +138,9 @@ This example depends on the following tools: `rocm-llvm` is installed with most ROCm installations. ## Used API surface + ### HIP runtime + - `hipFree` - `hipGetDeviceProperties` - `hipGetLastError` diff --git a/HIP-Basic/bandwidth/README.md b/HIP-Basic/bandwidth/README.md index 31bbba35..909f6637 100644 --- a/HIP-Basic/bandwidth/README.md +++ b/HIP-Basic/bandwidth/README.md @@ -1,22 +1,27 @@ # Cookbook Bandwidth Example ## Description + This example measures the memory bandwith capacity of GPU devices. It performs memcpy from host to GPU device, GPU device to host, and within a single GPU. -### Application flow +### Application flow + 1. User commandline arguments are parsed and test parameters initialized. If there are no commandline arguments then the test paramenters are initialized with default values. 2. Bandwidth tests are launched. 3. If the memory type for the test set to `-memory pageable` then the host side data is instantiated in `std::vector`. If the memory type for the test set to `-memory pinned` then the host side data is instantiated in `unsigned char*` and allocated using `hipHostMalloc`. 4. Device side storage is allocated using `hipMalloc` in `unsigned char*` 5. Memory transfer is performed `trail` amount of times using `hipMemcpy` for pageable memory or using `hipMemcpyAsync` for host allocated pinned memory. 6. Time of memory transfer operations is measured that is then used to calculate the bandwidth. -9. All device memory is freed using `hipFree` and all host allocated pinned memory is freed using `hipHostFree`. +7. All device memory is freed using `hipFree` and all host allocated pinned memory is freed using `hipHostFree`. ## Key APIs and Concepts + The program uses HIP pageable and pinned memory. It is important to note that the pinned memory is allocated using `hipHostMalloc` and is destroyed using `hipHostFree`. The HIP memory transfer routine `hipMemcpyAsync` will behave synchronously if the host memory is not pinned. Therefore, it is important to allocate pinned host memory using `hipHostMalloc` for `hipMemcpyAsync` to behave asynchronously. ## Demonstrated API Calls + ### HIP runtime + - `hipMalloc` - `hipMemcpy` - `hipMemcpyAsync` diff --git a/HIP-Basic/bit_extract/README.md b/HIP-Basic/bit_extract/README.md index a64fdd21..88079fca 100644 --- a/HIP-Basic/bit_extract/README.md +++ b/HIP-Basic/bit_extract/README.md @@ -1,9 +1,11 @@ # HIP-Basic Bit Extract Example ## Description + A HIP-specific bit extract solution is presented in this example. -### Application flow +### Application flow + 1. Allocate memory for host vectors. 2. Fill the input host vector as an arithmetic sequence by the vector index. 3. Allocate memory for device arrays. @@ -15,16 +17,21 @@ A HIP-specific bit extract solution is presented in this example. 9. "PASSED!" is printed when the flow was successful. ## Key APIs and Concepts + - `kernel_name<<>>()` is the HIP kernel launcher where the grid and block dimension, dynamic shared memory size and HIP stream is defined. We use NULL stream in the recent example. - `__bitextract_u32(source, bit_start, num_bits)` is the built-in AMD HIP bit extract operator, where we define a source scalar, a `bit_start` start bit and a `num_bits` number of extraction bits. The operator returns with a scalar value. ## Demonstrated API Calls + ### HIP runtime + #### Device symbols + - `threadIdx`, `blockIdx`, `blockDim`, `gridDim` - `__bitextract_u32` #### Host symbols + - `hipMalloc` - `hipFree` - `hipMemcpy` diff --git a/HIP-Basic/cooperative_groups/README.md b/HIP-Basic/cooperative_groups/README.md index 4714bc71..3c45005b 100644 --- a/HIP-Basic/cooperative_groups/README.md +++ b/HIP-Basic/cooperative_groups/README.md @@ -1,7 +1,8 @@ # HIP-Basic Cooperative Groups Example ## Description -This program showcases the usage of Cooperative Groups inside a reduction kernel. + +This program showcases the usage of Cooperative Groups inside a reduction kernel. Cooperative groups can be used to gain more control over synchronization. @@ -9,6 +10,7 @@ For more insights, you can read the following blog post: [Cooperative Groups: Flexible CUDA Thread Programming](https://developer.nvidia.com/blog/cooperative-groups/) ### Application flow + 1. A number of variables are defined to control the problem details and the kernel launch parameters. 2. Input vector is set up in host memory. 3. The input is copied to the device. @@ -18,19 +20,25 @@ For more insights, you can read the following blog post: 7. The elements of the result vectors are compared with the expected result. The result of the comparison is printed to the standard output. ## Key APIs and Concepts -Usually, programmers can only synchronize on warp-level or block-level. -But cooperative groups allows the programmer to partition threads together and subsequently synchronize those groups. + +Usually, programmers can only synchronize on warp-level or block-level. +But cooperative groups allows the programmer to partition threads together and subsequently synchronize those groups. The partitioned threads can reside across multiple devices. ## Demonstrated API Calls + ### HIP runtime + #### Device symbols + - `thread_group` - `thread_block` - `tiled_partition()` -- `thread_block_tile` +- `thread_block_tile` - All above from the [`cooperative_groups` namespace](https://github.com/ROCm-Developer-Tools/hipamd/blob/develop/include/hip/amd_detail/amd_hip_cooperative_groups.h) + #### Host symbols + - `hipMalloc` - `hipMemcpy` - `hipLaunchCooperativeKernel` diff --git a/HIP-Basic/device_globals/README.md b/HIP-Basic/device_globals/README.md index f65b9463..0ffc5edd 100644 --- a/HIP-Basic/device_globals/README.md +++ b/HIP-Basic/device_globals/README.md @@ -1,9 +1,11 @@ # HIP-Basic Device Globals Example ## Description + This program showcases a simple example that uses device global variables to perform a simple test kernel. Two such global variables are set using different methods: one is a single variable is set by first obtaining a pointer to it and using `hipMemcpy`, as would be done for a pointer to device memory using `hipMalloc`. The other is an array that is initialized without first explicitly obtaining the pointer by using `hipMemcpyToSymbol`. ### Application flow + 1. A number of constants are defined for the kernel launch parameters. 2. The input and output vectors are initialized in host memory. 3. The necessary amount of device memory for the input and output vectors is allocated and the input data is copied to the device. @@ -14,7 +16,9 @@ This program showcases a simple example that uses device global variables to per 8. The results are copied back to the host. 9. Device memory backing the input and output vectors is freed. 10. A reference computation is performed on the host and the results are compared with the expected result. The result of the comparison is printed to standard output. + ## Key APIs and Concepts + Apart from via kernel parameters, values can also be passed to the device via _device global variables_: global variables that have the `__device__` attribute. These can be used from device kernels, and need to be initialized from the host before they hold a valid value. Device global variables are persistent between kernel launches, so they can also be used to communicate values between lauches without explicitly managing a buffer for the on the host. A device global variable cannot be used as a regular global variable from the host side. To manage them, a pointer to the device memory that they represent needs to be obtained first. This can be done using the functions `hipGetSymbolAddress(dev_ptr, symbol)` and `hipGetSymbolSize(dev_ptr, symbol)`. A device global variable can be passed directly to this function by using the `HIP_SYMBOL(symbol)` macro. The resulting device pointer can be used in the same ways as memory obtained from `hipMalloc`, and so the corresponding value can be set by using `hipMemcpy`. @@ -22,8 +26,11 @@ A device global variable cannot be used as a regular global variable from the ho Device global variables may also be initialized directly by using the `hipMemcpyToSymbol(symbol, host_source, size_bytes, offset = 0, kind = hipMemcpyHostToDevice)`. This method omits having to fetch the pointer to the device global variable explicitly. Similarly, `hipMemcpyFromSymbol(host_dest, symbol, size_bytes, offset = 0, kind = hipMemcpyDeviceToHost)` can be used to copy from a device global variable back to the host. ## Demonstrated API Calls + ### HIP runtime + #### Device symbols + - `__global__` - `__device__` - `threadIdx` @@ -31,6 +38,7 @@ Device global variables may also be initialized directly by using the `hipMemcpy - `blockIdx` #### Host symbols + - `hipFree` - `hipGetLastError` - `hipGetSymbolAddress` diff --git a/HIP-Basic/device_query/README.md b/HIP-Basic/device_query/README.md index f537e598..ccc143a7 100644 --- a/HIP-Basic/device_query/README.md +++ b/HIP-Basic/device_query/README.md @@ -1,23 +1,29 @@ # HIP-Basic Device Query Example ## Description + This example shows how the target platform and compiler can be identified, as well as how properties from the device may be queried. -### Application flow -1. Using compiler-defined macros, the target platform and compiler are identified. -1. The number of devices in the system is queried, and for each device: +### Application flow + +1. Using compiler-defined macros, the target platform and compiler are identified. +2. The number of devices in the system is queried, and for each device: 1. The device is set as the active device. - 1. The device properties are queried and a selected set is printed. - 1. For each device in the system, it is queried and printed whether this device can access its memory. - 1. If NVIDIA is the target platform, some NVIDIA-specific device properties are printed. - 1. The amount of total and free memory of the device is queried and printed. + 2. The device properties are queried and a selected set is printed. + 3. For each device in the system, it is queried and printed whether this device can access its memory. + 4. If NVIDIA is the target platform, some NVIDIA-specific device properties are printed. + 5. The amount of total and free memory of the device is queried and printed. ## Key APIs and Concepts + - HIP code can target the AMD and the NVIDIA platform, and it can be compiled with different compilers. Compiler-defined macros can be used in HIP code to write code that is specific to a target or a compiler. See [HIP Programming Guide - Distinguishing Compiler Modes](https://docs.amd.com/bundle/HIP-Programming-Guide-v5.2/page/Transitioning_from_CUDA_to_HIP.html#d4438e664) for more details. + - `hipGetDeviceCount` returns the number of devices in the system. Some device management API functions take an identifier for each device, which is a monotonically incrementing number starting from zero. Others require the active device to be set, with `hipSetDevice`. A full overview of the device management API can be found at [HIP API - Device Management](https://docs.amd.com/bundle/HIP_API_Guide/page/group___device.html). ## Demonstrated API Calls + ### HIP Runtime + - `__HIP_PLATFORM_AMD__` - `__HIP_PLATFORM_NVIDIA__` - `__CUDACC__` diff --git a/HIP-Basic/dynamic_shared/README.md b/HIP-Basic/dynamic_shared/README.md index 42284600..ab9dd6b2 100644 --- a/HIP-Basic/dynamic_shared/README.md +++ b/HIP-Basic/dynamic_shared/README.md @@ -1,9 +1,11 @@ # HIP-Basic Dynamic Shared Memory Example ## Description + This program showcases an implementation of a simple matrix tranpose kernel, which uses shared memory that is dynamically allocated at runtime. ### Application flow + 1. A number of constants are defined to control the problem details and the kernel launch parameters. 2. Input matrix is set up in host memory. 3. The necessary amount of device memory is allocated and input is copied to the device. @@ -13,25 +15,33 @@ This program showcases an implementation of a simple matrix tranpose kernel, whi 7. The elements of the result matrix are compared with the expected result. The result of the comparison is printed to the standard output. ## Key APIs and Concepts + Global memory is the main memory on a GPU. This memory is used when transferring data between host and device. It has a large capacity, but also has a relatively high latency to access, which limits the performance of parallel programs. To help mitigate the effects of global memory latency, each GPU multiprocessor is equipped with a local amount of _shared_ memory. Shared memory is accessible by all threads in a multiprocessor and is typically much faster than using global memory. Each multiprocessor on a GPU has a fixed amount of shared memory, typically between 32 and 64 kilobytes. In HIP code, variables can be declared to be placed in shared memory by using the `__shared__` attribute. A GPU multiprocessor can process multiple blocks of a kernel invocation simultaneously. In order to allocate shared memory for each block, the GPU runtime needs to know the total shared memory that each kernel can use, so that it can calculate how many groups can run at the same time. When declaring shared variables of which the size is known at compile time, the compiler computes the total size automatically. Some times, however, this size may not be known in advance, for example when the required amount of shared memory depends on the input size. In these cases, it is not beneficial to declare an upper bound, as this may unnecessarily limit the number of blocks that can be processed at the same time. In these situations _dynamic shared memory_ can be used. This is an amount of shared memory of which the size may be given at runtime. Dynamic shared memory is used by declaring an `extern` shared variable of a variable-length array of unspecified size: + ```c++ extern __shared__ type var[]; ``` The GPU runtime still needs to know the total amount of shared memory that a kernel will use, and for this reason this value needs to be passed with the execution configuration when launching the kernel. When using the `myKernelName<<<...>>>` kernel launch syntax, this is simply a parameter that indicates the required amount: + ```c++ kernel_name<<>>(); ``` ## Demonstrated API Calls + ### HIP runtime + #### Device symbols + - `threadIdx`, `blockIdx`, `blockDim` - `__shared__` - `__syncthreads` + #### Host symbols + - `hipMalloc` - `hipMemcpy` - `hipGetLastError` diff --git a/HIP-Basic/events/README.md b/HIP-Basic/events/README.md index 607eaa11..788abfb7 100644 --- a/HIP-Basic/events/README.md +++ b/HIP-Basic/events/README.md @@ -1,10 +1,13 @@ # HIP-Basic Events Example + ## Description + Memory transfer and kernel execution are the most important parameters in parallel computing, especially in high performance computing (HPC) and machine learning. Memory bottlenecks are the main problem why we are not able to get the highest performance, therefore obtaining the memory transfer timing and kernel execution timing plays key role in application optimization. This example showcases measuring kernel and memory transfer timing using HIP events. The kernel under measurement is a trivial one that performs square matrix transposition. -### Application flow +### Application flow + 1. A number of parameters are defined that control the problem details and the kernel launch. 2. Input data is set up in host memory. 3. The necessary amount of device memory is allocated. @@ -18,17 +21,22 @@ This example showcases measuring kernel and memory transfer timing using HIP eve 11. The result data is validated by comparing it to the product of the reference (host) implementation. The result of the validation is printed to the standard output. ## Key APIs and Concepts + - The `hipEvent_t` type defines HIP events that can be used for synchronization and time measurement. The events must be initialized using `hipEventCreate` before usage and destroyed using `hipEventDestroy` after they are no longer needed. - The events have to be queued on a device stream in order to be useful, this is done via the `hipEventRecord` function. The stream itself is a list of jobs (memory transfers, kernel executions and events) that execute sequentially. When the event is processed by the stream, the current machine time is recorded to the event. This can be used to measure execution times on the stream. In this example, the default stream is used. - The time difference between two recorded events can be accessed using the function `hipEventElapsedTime`. - An event can be used to synchronize the execution of the jobs on a stream with the execution of the host. A call to `hipEventSynchronize` blocks the host until the provided event is scheduled on its stream. ## Demonstrated API Calls + ### HIP runtime + #### Device symbols + - `threadIdx`, `blockIdx`, `blockDim` #### Host symbols + - `hipMalloc` - `hipFree` - `hipMemcpy` diff --git a/HIP-Basic/gpu_arch/README.md b/HIP-Basic/gpu_arch/README.md index cb14e92b..3defdc48 100644 --- a/HIP-Basic/gpu_arch/README.md +++ b/HIP-Basic/gpu_arch/README.md @@ -1,9 +1,11 @@ # HIP-Basic GPU Architecture-specific Code Example ## Description + This program showcases an implementation of a simple matrix transpose kernel, which uses a different codepath depending on the target architecture. ### Application flow + 1. A number of constants are defined to control the problem details and the kernel launch parameters. 2. Input matrix is set up in host memory. 3. The necessary amount of device memory is allocated and input is copied to the device. @@ -13,6 +15,7 @@ This program showcases an implementation of a simple matrix transpose kernel, wh 7. The elements of the result matrix are compared with the expected result. The result of the comparison is printed to the standard output. ## Key APIs and Concepts + This example showcases two different codepaths inside a GPU kernel, depending on the target architecture. You may want to use architecture-specific inline assembly when compiling for a specific architecture, without losing compatibility with other architectures (see the [inline_assembly](/HIP-Basic/inline_assembly/main.hip) example). @@ -20,11 +23,16 @@ You may want to use architecture-specific inline assembly when compiling for a s These architecture-specific compiler definitions only exist within GPU kernels. If you would like to have GPU architecture-specific host-side code, you could query the stream/device information at runtime. ## Demonstrated API Calls + ### HIP runtime + #### Device symbols + - `threadIdx`, `blockIdx`, `blockDim` - `__gfx1010__`, `__gfx1011__`, `__gfx1012__`, `__gfx1030__`, `__gfx1031__`, `__gfx1100__`, `__gfx1101__`, `__gfx1102__` + #### Host symbols + - `hipMalloc` - `hipMemcpy` - `hipGetLastError` diff --git a/HIP-Basic/hello_world/README.md b/HIP-Basic/hello_world/README.md index 7b6a8595..204efd99 100644 --- a/HIP-Basic/hello_world/README.md +++ b/HIP-Basic/hello_world/README.md @@ -1,27 +1,38 @@ # HIP-Basic Hello World Example ## Description + This example showcases launching kernels and printing from device programs. -### Application flow +### Application flow + 1. A kernel is launched: function `hello_world_kernel` is executed on the device. This function uses the coordinate built-ins to print a unique identifier from each thread. 2. _Synchronization_ is performed: the host program execution halts until all kernels on the device have finished executing. ## Key APIs and Concepts + - `myKernelName<<>>(kernelArguments)` launches a kernel. In other words: it calls a function marked with `__global__` to execute on the device. An _execution configuration_ is specified, which are the grid and block dimensions, the amount of additional shared memory to allocate, and the stream where the kernel should execute. Optionally, the kernel function may take arguments as well. + - `hipDeviceSynchronize` synchronizes with the device, halting the host until all commands associated with the device have finished executing. + - Printing from device functions is performed using `printf`. -- Function-type qualifiers are used to indicate the type of a function. - - `__global__` functions are executed on the device and called from the host. - - `__device__` functions are executed on the device and called from the device only. - - `__host__` functions are executed on the host and called from the host. - - Functions marked with both `__device__` and `__host__` are compiled for host and device. This means that these functions cannot contain any device- or host-specific code. -- Coordinate built-ins determine the coordinate of the active work item in the execution grid. - - `threadIdx` is the 3D coordinate of the active work item in the block of threads. - - `blockIdx` is the 3D coordinate of the active work item in the grid of blocks. + +- Function-type qualifiers are used to indicate the type of a function. + + - `__global__` functions are executed on the device and called from the host. + - `__device__` functions are executed on the device and called from the device only. + - `__host__` functions are executed on the host and called from the host. + - Functions marked with both `__device__` and `__host__` are compiled for host and device. This means that these functions cannot contain any device- or host-specific code. + +- Coordinate built-ins determine the coordinate of the active work item in the execution grid. + + - `threadIdx` is the 3D coordinate of the active work item in the block of threads. + - `blockIdx` is the 3D coordinate of the active work item in the grid of blocks. ## Demonstrated API Calls + ### HIP Runtime + - `hipDeviceSynchronize` - `__device__` - `__global__` @@ -30,4 +41,5 @@ This example showcases launching kernels and printing from device programs. - `blockIdx` ## Supported Platforms + Windows is currently not supported by the hello world example, due to a driver failure with `printf` from device code. diff --git a/HIP-Basic/hipify/README.md b/HIP-Basic/hipify/README.md index 53069044..e1d30475 100644 --- a/HIP-Basic/hipify/README.md +++ b/HIP-Basic/hipify/README.md @@ -1,17 +1,23 @@ # HIP-Basic Hipify Example ## Description + The hipify example demonstrates the use of HIP utility `hifipy-perl` to port CUDA code to HIP. It converts a CUDA `.cu` source code into a portable HIP `.hip` source code that can be compiled using `hipcc` and executed on any supported GPU (AMD or NVIDIA). -### Application flow + +### Application flow + 1. The build system (either `cmake` or `Makefile`) first converts the `main.cu` source code into a HIP portable `main.hip` source code. It uses `hipify-perl main.cu > main.hip` command to achieve the conversion. 2. `main.hip` is then compiled using `hipcc main.hip -o hip_hipify` to generate the executable file. 3. The execuatable program launches a simple kernel that computes the square of each element of a vector. ## Key APIs and Concepts + `hipify-perl` is a utility that converts CUDA `.cu` source code into HIP portable code. It parses CUDA files and produces the equivalent HIP portable `.hip` source file. ## Used API surface + ### HIP runtime + - `hipGetErrorString` - `hipGetDeviceProperties` - `hipMalloc` diff --git a/HIP-Basic/inline_assembly/README.md b/HIP-Basic/inline_assembly/README.md index 3f427721..55d04ada 100644 --- a/HIP-Basic/inline_assembly/README.md +++ b/HIP-Basic/inline_assembly/README.md @@ -1,11 +1,11 @@ # HIP-Basic Inline Assembly Example ## Description -This program showcases an implementation of a simple matrix transpose kernel, which uses inline assembly and works on both AMD and NVIDIA hardware. -By using inline assembly in your kernels, you may be able to gain extra performance. -It could also enable you to use special GPU hardware features which are not available through compiler intrinsics. +This program showcases an implementation of a simple matrix transpose kernel, which uses inline assembly and works on both AMD and NVIDIA hardware. +By using inline assembly in your kernels, you may be able to gain extra performance. +It could also enable you to use special GPU hardware features which are not available through compiler intrinsics. For more insights, please read the following blogs by Ben Sander: [The Art of AMDGCN Assembly: How to Bend the Machine to Your Will](https://gpuopen.com/learn/amdgcn-assembly/) & @@ -15,8 +15,8 @@ For more information: [AMD ISA documentation for current architectures](https://gpuopen.com/amd-isa-documentation/) & [User Guide for LLVM AMDGPU Back-end](https://llvm.org/docs/AMDGPUUsage.html) - ### Application flow + 1. A number of variables are defined to control the problem details and the kernel launch parameters. 2. Input matrix is set up in host memory. 3. The necessary amount of device memory is allocated and input is copied to the device. @@ -26,6 +26,7 @@ For more information: 7. The elements of the result matrix are compared with the expected result. The result of the comparison is printed to the standard output. ## Key APIs and Concepts + Using inline assembly in GPU kernels is somewhat similar to using inline assembly in host-side code. The `volatile` statement tells the compiler to not remove the assembly statement during optimizations. ```c++ @@ -35,11 +36,16 @@ asm volatile("v_mov_b32_e32 %0, %1" : "=v"(variable_0) : "v"(variable_1)) However, since the instruction set differs between GPU architectures, you usually want to use the appropriate GPU architecture compiler defines to support multiple architectures (see the [gpu_arch](/HIP-Basic/gpu_arch/main.hip) example for more fine-grained architecture control). ## Demonstrated API Calls + ### HIP runtime + #### Device symbols + - `threadIdx`, `blockIdx`, `blockDim` - `__HIP_PLATFORM_AMD__`, `__HIP_PLATFORM_NVIDIA__` + #### Host symbols + - `hipMalloc` - `hipMemcpy` - `hipGetLastError` diff --git a/HIP-Basic/llvm_ir_to_executable/README.md b/HIP-Basic/llvm_ir_to_executable/README.md index acdc53ee..4d0cfdff 100644 --- a/HIP-Basic/llvm_ir_to_executable/README.md +++ b/HIP-Basic/llvm_ir_to_executable/README.md @@ -1,6 +1,7 @@ # HIP-Basic LLVM-IR to Executable Example ## Description + This example shows how to manually compile and link a HIP application from device LLVM IR. Pre-generated LLVM-IR files are compiled into an _offload bundle_, a bundle of device object files, and then linked with the host object file to produce the final executable. LLVM IR is the intermediary language used by the LLVM compiler, which hipcc is built on. Building HIP executables from LLVM IR can be useful for example to experiment with specific LLVM instructions, or can help debugging miscompilations. @@ -8,30 +9,39 @@ LLVM IR is the intermediary language used by the LLVM compiler, which hipcc is b ### Building - Build with Makefile: to compile for specific GPU architectures, optionally provide the HIP_ARCHITECTURES variable. Provide the architectures separated by comma. + ```shell make HIP_ARCHITECTURES="gfx803;gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100;gfx1101;gfx1102 ``` + - Build with CMake: + ```shell cmake -S . -B build -DCMAKE_HIP_ARCHITECTURES="gfx803;gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100;gfx1101;gfx1102" cmake --build build ``` + On Windows the path to RC compiler may be needed: `-DCMAKE_RC_COMPILER="C:/Program Files (x86)/Windows Kits/path/to/x64/rc.exe"` ## Generating device LLVM IR + In this example, a HIP executable is compiled from device LLVM IR code. While LLVM IR can be written completely manually, is it not advisable to do so because it is unstable between LLVM versions. Instead, in this example it is generated from `main.hip`, using the following commands: + ```shell $ROCM_INSTALL_DIR/bin/hipcc --cuda-device-only -c -emit-llvm ./main.hip --offload-arch= -o main_.bc -I ../../Common -std=c++17 $ROCM_INSTALL_DIR/llvm/bin/llvm-dis main_.bc -o main_.ll ``` + Where `` is the architecture to generate the LLVM IR for. Note that the `--cuda-device-only` flag is required to instruct `hipcc` to only generate LLVM IR for the device part of the computation, and `-c` is required to prevent the compiler from linking the outputs into an executable. In the case of this example, the LLVM IR files where generated using architectures `gfx803`, `gfx900`, `gfx906`, `gfx908`, `gfx90a`, `gfx1030`, `gfx1100`, `gfx1101`, `gfx1102`. The user may modify the `--offload-arch` flag to build for other architectures and choose to either enable or disable extra device code-generation features such as `xnack` or `sram-ecc`, which can be specified as `--offload-arch=:+` to enable it or `--offload-arch=:-` to disable it. Multiple features may be present, separated by colons. The first of these two commands generates a _bitcode_ module: this is a binary encoded version of LLVM IR. The second command, using `llvm-dis` disassembles the bitcode module into textual LLVM IR. ## Build Process + A HIP binary consists of a regular host executable, which has an offload bundle containing device code embedded inside it. This offload bundle contains object files for each of the target devices that it is compiled for, and is loaded at runtime to provide the machine code for the current device. A HIP executable can be built from device LLVM IR and host HIP code according to the following process: 1. The `main.hip` file is compiled to an object file with `hipcc` that only contains host code by using the `--cuda-host-only` option. `main.hip` is a program that launches a simple kernel to compute the square of each element of a vector. The `-c` option is required to prevent the compiler from creating an executable, and make it create an object file containing the compiled host code instead. + ```shell $ROCM_INSTALL_DIR/bin/hipcc -c --cuda-host-only main.hip ``` @@ -57,6 +67,7 @@ A HIP binary consists of a regular host executable, which has an offload bundle Note: using -bundle-align=4096 only works on ROCm 4.0 and newer compilers. Also, the architecture must match the same `--offload-arch` as when compiling the source to LLVM bitcode. 4. The offload bundle is embedded inside an object file that can be linked with the object file containing the host code. The offload bundle must be placed in the `.hip_fatbin` section, and must be placed after the symbol `__hip_fatbin`. This can be done by creating an assembly file that places the offload bundle in the appropriate section using the `.incbin` directive: + ```nasm .type __hip_fatbin,@object ; Tell the assembler to place the offload bundle in the appropriate section. @@ -69,24 +80,32 @@ A HIP binary consists of a regular host executable, which has an offload bundle ; Include the binary .incbin "offload_bundle.hipfb" ``` + This file can then be assembled using `llvm-mc` as follows: + ```shell $ROCM_INSTALL_DIR/llvm/bin/llvm-mc -triple -o main_device.o hip_obj_gen.mcin --filetype=obj ``` 5. Finally, using the system linker, `hipcc`, or `clang`, the host object and device objects are linked into an executable: + ```shell /hip/bin/hipcc -o hip_llvm_ir_to_executable main.o main_device.o ``` ### Visual Studio 2019 + The above compilation steps are implemented in Visual Studio through Custom Build Steps: + - Specifying that only host compilation should be done, is achieved by adding extra options to the source file, under `main.hip -> properties -> C/C++ -> Command Line`: - ``` + + ```shell Additional Options: --cuda-host-only ``` + - Specifying how the LLVM IR and the offload bundle are generated, is done with a custom build step: - ``` + + ```shell Command Line: FOR %%a in ($(OffloadArch)) DO "$(ClangToolPath)clang++" --cuda-device-only -c -emit-llvm main.hip --offload-arch=%%a -o "$(IntDir)main_%%a.bc" -I ../../Common -std=c++17 FOR %%a in ($(OffloadArch)) DO "$(ClangToolPath)llvm-dis" "$(IntDir)main_%%a.bc" -o "$(IntDir)main_%%a.ll" @@ -102,8 +121,10 @@ The above compilation steps are implemented in Visual Studio through Custom Buil Additional Dependencies: main.hip;$(IntDir)hip_obj_gen_win.mcin Execute Before: ClCompile ``` + - Finally, the linking step is described by passing additional inputs to the linker in `project -> properties -> Linker -> Input`: - ``` + + ```shell Additional Dependencies: $(IntDir)main_device.obj;%(AdditionalDependencies) ``` @@ -118,7 +139,9 @@ This example depends on the following tools: `rocm-llvm` is installed with most ROCm installations. ## Used API surface + ### HIP runtime + - `hipFree` - `hipGetDeviceProperties` - `hipGetLastError` diff --git a/HIP-Basic/matrix_multiplication/README.md b/HIP-Basic/matrix_multiplication/README.md index 98ddf0ff..f76b9d82 100644 --- a/HIP-Basic/matrix_multiplication/README.md +++ b/HIP-Basic/matrix_multiplication/README.md @@ -1,9 +1,11 @@ # HIP-Basic Matrix Multiplication Example ## Description + This example showcases the multiplication of two dynamically sized two-dimensional matrices on the GPU ($\mathrm{A \cdot B=C}$). The sizes of the matrices can be provided on the command line, however the sizes must be multiples of the hard-coded block size, which is 16x16. This implementation is not aimed at best performance or best generality, although some optimizations, such as the utilization of shared memory, are in place. -### Application flow +### Application flow + 1. Default values for dimensions of matrix $\mathrm{A}$ and the number of columns of matrix $\mathrm{B}$ are set. 2. Command line arguments are parsed (if any) and the matrix dimensions are updated. If the command line arguments do not match the specification, an error message is printed to the standard output and the program terminates with a non-zero exit code. 3. Host memory is allocated for the matrices $\mathrm{A}$, $\mathrm{B}$ and $\mathrm{C}$ (using `std::vector`) and the elements of both $\mathrm{A}$ and $\mathrm{B}$ are set to two different constant values. @@ -12,29 +14,40 @@ This example showcases the multiplication of two dynamically sized two-dimension 6. The elements of the resulting matrix $\mathrm{C}$ are copied to the host and all device memory is freed. 7. The elements of $\mathrm{C}$ are compared with the expected result. The result of the comparison is printed to the standard output. -### Command line interface +## Command line interface + - If no command line argument is provided, the default matrix sizes are used. + - Otherwise, exactly 3 arguments must be provided. All must be positive integers which are multiples of the block size (16). The order of the arguments is the following: rows of $\mathrm{A}$, columns of $\mathrm{A}$, columns of $\mathrm{B}$. Notice that rows of $\mathrm{B}$ cannot be specified, as it must match the columns of $\mathrm{A}$. ## Key APIs and Concepts + - The kernel implemented in this example performs a matrix multiplication over dynamically sized matrices. The value of $\mathrm{C}$ at row $i$ and column $j$ is calculated with the following formula (where $N$ equals to the columns of $\mathrm{A}$ and rows of $\mathrm{B}$): $$c_{ij}=\sum_{k=1}^{N}a_{ik}b_{kj}$$ - The kernel is launched in a two-dimensional grid in which each thread is responsible for calculating a single element of the resulting matrix. The threads are organized into 16x16 blocks. Since each block is executed on a single compute unit of the GPU hardware, data can be exchanged between these threads via shared memory. + - The matrix multiplication is conducted in multiple steps, each step calculating the partial results of a submatrix of size 16x16 (the block size). The number of steps is the columns of $\mathrm{A}$ divided by the block size. + - For improved performance, in each step the threads first load the corresponding submatrices from both $\mathrm{A}$ and $\mathrm{B}$ to the shared memory. Thereby each thread has to perform only one global memory fetch instead of loading the full 16 item row from each submatrix. + - Between loading and using values to/from shared memory, a call to `__syncthreads` has to be invoked. This is to ensure that all threads have finished writing to the shared memory before other threads might use the same memory locations. + - The reason behind this is that it is not guaranteed that all threads in the block execute concurrently. Indeed, the compute unit schedules the threads to execute in so called "wavefronts". While one wavefront is waiting for memory operations to complete, another one might get scheduled to execute. The call to `__syncthreads` ensures that all threads in the block finish the pending memory operations and the loaded memory can safely be used from any other thread. ## Used API surface + ### HIP runtime + #### Device symbols + - `threadIdx`, `blockIdx`, `blockDim`, `gridDim` - `__shared__` - `__syncthreads` #### Host symbols + - `hipMalloc` - `hipMemcpy` - `hipGetLastError` diff --git a/HIP-Basic/module_api/README.md b/HIP-Basic/module_api/README.md index a55ce82c..7321e3b8 100644 --- a/HIP-Basic/module_api/README.md +++ b/HIP-Basic/module_api/README.md @@ -1,9 +1,11 @@ # HIP-Basic Module API Example ## Description + This example shows how to load and execute a HIP module in runtime without linking it to the rest of the code during compilation. ### Application flow + 1. Set up the name of the compiled module code object file `(*.co)`, located in the same directory. 2. Define kernel launch parameters. 3. Initialize input and output vectors in host memory. @@ -18,42 +20,51 @@ This example shows how to load and execute a HIP module in runtime without linki 12. Compare input and output vectors. The result of the comparison is printed to standard output. ## Building + The kernel module needs to be compiled as a non-linked device code object file (`*.co`), in one of the following ways: - - `hipcc --genco --offload-arch=[TARGET GPU] [INPUT FILE] -o [OUTPUT FILE]` - - `clang++ --cuda-device-only --offload-arch=[TARGET GPU] [INPUT FILE] -o [OUTPUT FILE]` + +- `hipcc --genco --offload-arch=[TARGET GPU] [INPUT FILE] -o [OUTPUT FILE]` +- `clang++ --cuda-device-only --offload-arch=[TARGET GPU] [INPUT FILE] -o [OUTPUT FILE]` where the parameters are: - - `[TARGET GPU]`: GPU architecture (e.g. `gfx908` or `gfx90a:xnack-`). - - `[INPUT FILE]`: Name of the file containing kernels (e.g. `module.hip`). - - `[OUTPUT FILE]`: Name of the generated code object file (e.g. `module.co`). + +- `[TARGET GPU]`: GPU architecture (e.g. `gfx908` or `gfx90a:xnack-`). +- `[INPUT FILE]`: Name of the file containing kernels (e.g. `module.hip`). +- `[OUTPUT FILE]`: Name of the generated code object file (e.g. `module.co`). The `main.hip` example file is compiled similarly as in the other examples. ## Key APIs and Concepts + - The `hipModuleLoad(hipModule_t *module, const char *file_name)` will load a HIP module in execution time from the path that is given as an input parameter or return an error. - The `hipModuleGetFunction(hipFunction_t *kernel_function, hipModule_t module, const char *kernel_name)` will fetch a reference to the `__global__` kernel function in the HIP module. - `hipModuleLaunchKernel` will launch kernel function on the device. The input parameters are: - - `hipFunction_t kernel_function` Kernel function. - - `unsigned int gridDimX`: Number of blocks in the dimension X. - - `unsigned int gridDimY`: Number of blocks in the dimension Y. - - `unsigned int gridDimZ`: Number of blocks in the dimension Z. - - `unsigned int blockDimX`: Number of threads in the dimension X in a block. - - `unsigned int blockDimY`: Number of threads in the dimension Y in a block. - - `unsigned int blockDimZ`: Number of threads in the dimension Z in a block. - - `unsigned int sharedMemBytes`: Amount of dynamic shared memory that will be available to each workgroup, in bytes. (Not used in this example.) - - `hipStream_t stream`: The device stream, on which the kernel should be dispatched. (`hipStreamDefault` int this example.) - - `void **kernelParams`: Pointer to the arguments needed by the kernel. Note that this parameter is not yet implemented, and thus the _extra_ parameter (the last one described in this list) should be used to pass arguments to the kernel. (Thereby `nullptr` is used in the example.) - - `void **extra`: Pointer to all extra arguments passed to the kernel. They must be in the memory layout and alignment expected by the kernel. The list of arguments must end with `HIP_LAUNCH_PARAM_END`. + + - `hipFunction_t kernel_function` Kernel function. + - `unsigned int gridDimX`: Number of blocks in the dimension X. + - `unsigned int gridDimY`: Number of blocks in the dimension Y. + - `unsigned int gridDimZ`: Number of blocks in the dimension Z. + - `unsigned int blockDimX`: Number of threads in the dimension X in a block. + - `unsigned int blockDimY`: Number of threads in the dimension Y in a block. + - `unsigned int blockDimZ`: Number of threads in the dimension Z in a block. + - `unsigned int sharedMemBytes`: Amount of dynamic shared memory that will be available to each workgroup, in bytes. (Not used in this example.) + - `hipStream_t stream`: The device stream, on which the kernel should be dispatched. (`hipStreamDefault` int this example.) + - `void **kernelParams`: Pointer to the arguments needed by the kernel. Note that this parameter is not yet implemented, and thus the _extra_ parameter (the last one described in this list) should be used to pass arguments to the kernel. (Thereby `nullptr` is used in the example.) + - `void **extra`: Pointer to all extra arguments passed to the kernel. They must be in the memory layout and alignment expected by the kernel. The list of arguments must end with `HIP_LAUNCH_PARAM_END`. ## Demonstrated API Calls + ### HIP runtime + #### Device symbols + - `__global__` - `threadIdx` #### Host symbols + - `hipGetLastError` - `hipGetSymbolAddress` - `hipGetSymbolSize` diff --git a/HIP-Basic/moving_average/README.md b/HIP-Basic/moving_average/README.md index 9fc26087..969012b7 100644 --- a/HIP-Basic/moving_average/README.md +++ b/HIP-Basic/moving_average/README.md @@ -1,9 +1,11 @@ # HIP-Basic Moving Average Example ## Description + This example shows the use of a kernel that computes a moving average on one-dimensional data. In a sequential program, the moving average of a given input array is found by processing the elements one by one. The average of the previous $n$ elements is called the moving average, where $n$ is called the _window size_. In this example, a kernel is implemented to compute the moving average in parallel, using the shared memory as a cache. ### Application flow + 1. Define constants to control the problem size and the kernel launch parameters. 2. Allocate and initialize the input array. This array is initialized as the sequentially increasing sequence $0, 1, 2, \ldots\mod n$. 3. Allocate the device array and copy the host array to it. @@ -11,11 +13,15 @@ This example shows the use of a kernel that computes a moving average on one-dim 5. Copy the result back to the host and validate it. As each average is computed using $n$ consecutive values from the input array, the average is computed over the values $0, 1, 2,\ldots, n - 1 $, the average of which is equal to $(n-1)/2$. ## Key APIs and Concepts + Device memory is allocated with `hipMalloc`, deallocated with `hipFree`. Copies to and from the device are made with `hipMemcpy` with options `hipMemcpyHostToDevice` and `hipMemcpyDeviceToHost`, respectively. A kernel is launched with the `myKernel<<>>()`-syntax. Shared memory is allocated in the kernel with the `__shared__` memory space specifier. ## Demonstrated API Calls + ### HIP runtime + #### Device symbols + - `__shared__` - `__syncthreads` - `blockDim` @@ -23,6 +29,7 @@ Device memory is allocated with `hipMalloc`, deallocated with `hipFree`. Copies - `threadIdx` #### Host symbols + - `__global__` - `hipFree` - `hipGetLastError` diff --git a/HIP-Basic/multi_gpu_data_transfer/README.md b/HIP-Basic/multi_gpu_data_transfer/README.md index 3a650577..af5576cc 100644 --- a/HIP-Basic/multi_gpu_data_transfer/README.md +++ b/HIP-Basic/multi_gpu_data_transfer/README.md @@ -1,6 +1,7 @@ # HIP-Basic Multi GPU Data Transfer Example ## Description + Peer-to-peer (P2P) communication allows direct communication over PCIe (or NVLINK, in some NVIDIA configurations) between devices. Given that it is not necessary to access the host in order to transfer data between devices, P2P communications provide a lower latency than traditional communications that do need to access the host. Because P2P communication is done over PCIe/NVLINK, the availability of this type of communication among devices depends mostly on the PCIe/NVLINK topology existing. @@ -8,6 +9,7 @@ Because P2P communication is done over PCIe/NVLINK, the availability of this typ In this example, the result of a matrix transpose kernel execution on one device is directly copied to the other one, showcasing how to carry out a P2P communication between two GPUs. ### Application flow + 1. P2P communication support is checked among the available devices. In case two of these devices are found to have it between them, they are selected for the example. A trace message informs about the IDs of the devices selected. 2. The input and output matrices are allocated and initialized in host memory. 3. The first device selected is set as the current device, device memory for the input and output matrices is allocated on the current device and the input data is copied from the host. @@ -21,6 +23,7 @@ In this example, the result of a matrix transpose kernel execution on one device 11. Results are validated and printed to the standard output. ## Key APIs and Concepts + - `hipGetDeviceCount` gives the number of devices available. In this example it allows to check if there is more than one device available. - `hipDeviceCanAccessPeer` queries whether a certain device can directly access the memory of a given peer device. A P2P communication is supported between two devices if this function returns true for those two devices. - `hipSetDevice` sets the specified device as the default device for the subsequent API calls. Such a device is then known as _current device_. @@ -34,16 +37,19 @@ In this example, the result of a matrix transpose kernel execution on one device ## Demonstrated API Calls ### HIP runtime + - `__global__` - `__shared__` #### Device symbols + - `blockDim` - `blockIdx` - `threadIdx` - `__syncthreads` #### Host symbols + - `hipDeviceCanAccessPeer` - `hipDeviceDisablePeerAccess` - `hipDeviceEnablePeerAccess` diff --git a/HIP-Basic/occupancy/README.md b/HIP-Basic/occupancy/README.md index a38be5b9..16325bf1 100644 --- a/HIP-Basic/occupancy/README.md +++ b/HIP-Basic/occupancy/README.md @@ -1,11 +1,13 @@ # HIP-Basic Occupancy Example ## Description + This example showcases how to find optimal configuration parameters for a kernel launch with maximum occupancy. It uses the HIP occupancy calculator APIs to find a kernel launch configuration that yields maximum occupancy. This configuration is used to launch a kernel and measures the utilization difference against another kernel launch that is manually (and suboptimally) configured. The application kernel is a simple vector--vector multiplication of the form `C[i] = A[i]*B[i]`, where `A`, `B` and `C` are vectors of size `size`. The example shows 100% occupancy for both manual and automatic configurations, because the simple kernel does not use much resources per-thread or per-block, especially `__shared__` memory. The execution time for the automatic launch is still lower because of a lower overhead associated with fewer blocks being executed. -### Application flow +### Application flow + 1. Host side data is instantiated in `std::vector`. 2. Device side storage is allocated using `hipMalloc` in `float*`. 3. Data is copied from host to device using `hipMemcpy`. @@ -17,13 +19,17 @@ The example shows 100% occupancy for both manual and automatic configurations, b 9. All device memory is freed using `hipFree`. ## Key APIs and Concepts + GPUs have large amount of parallel resources available. Utilizing these resources in an optimal way is very important to achieve best performance. The HIP occupancy calculator API `hipOccupancyMaxPotentialBlockSize` allows finding kernel block size that launches most amount of threads per thread block for a given kernel. The `hipOccupancyMaxActiveBlocksPerMultiprocessor` calculates maximum active blocks per GPU multiprocessor for a given block size and kernel. ### Occupancy + Occupancy is the ratio of active wavefronts (or warps) to the maximum number of wavefronts (or warps) that can be deployed on a GPU multiprocessor. HIP GPU threads execute on a GPU multiprocessor, which has limited resources such as registers and shared memory. These resources are shared among threads within a thread block. When the usage of these shared resources is minimized (by compiler optimization or user code design) more blocks can simultaneously execute per multiprocessor thereby increasing the occupancy. ## Used API surface + ### HIP runtime + - `hipMalloc` - `hipMemcpy` - `hipEventCreate` diff --git a/HIP-Basic/opengl_interop/README.md b/HIP-Basic/opengl_interop/README.md index 42f1323c..80d2aee1 100644 --- a/HIP-Basic/opengl_interop/README.md +++ b/HIP-Basic/opengl_interop/README.md @@ -1,10 +1,13 @@ # HIP-Basic OpenGL Interop Example ## Description + External device resources and other handles can be shared with HIP in order to provide interoperability between different GPU APIs. This example showcases a HIP program that interacts with OpenGL: a simple HIP kernel is used to simulate a sine wave over a grid of pointers, in a buffer that is shared with OpenGL. The resulting data is then rendered to a window as a grid of triangles using OpenGL. ### Application flow + #### Initialization + 1. A window is opened using the GLFW library 2. OpenGL is initialized: the window's context is made active, function pointers are loaded, debug output is enabled if possible. 3. A HIP device is picked that is OpenGL-interop capable with the current OpenGL context by using `hipGLGetDevices`. @@ -14,14 +17,18 @@ External device resources and other handles can be shared with HIP in order to p 7. OpenGL rendering state is bound. #### Rendering + 1. The sinewave simulation kernel is launched in order to update the OpenGL shared buffer. 2. The grid is drawn to the window's framebuffer. 3. The window's framebuffer is presented to the screen. ## Dependencies + This example has additional library dependencies besides HIP: + - [GLFW](https://glfw.org). There are three options for getting this dependency satisfied: 1. Install it through a package manager. Available for Linux, where GLFW can be installed from some of the usual package managers: + - APT: `apt-get install libglfw3-dev` - Pacman: `pacman -S glfw-x11` or `pacman -S glfw-wayland` - DNF: `dnf install glfw-devel` @@ -30,15 +37,21 @@ This example has additional library dependencies besides HIP: - APT: `apt-get install libxxf86vm-dev libxi-dev` - Pacman: `pacman -S libxi libxxf86vm` - DNF: `dnf install libXi-devel libXxf86vm-devel` + 2. Build from source. GLFW supports compilation on Windows with Visual C++ (2010 and later), MinGW and MinGW-w64 and on Linux and other Unix-like systems with GCC and Clang. Please refer to the [compile guide](https://www.glfw.org/docs/latest/compile.html) for a complete guide on how to do this. Note: not only it should be built as explained in the guide, but it is additionally needed to build with the install target (`cmake --build --target install`). + 3. Get the pre-compiled binaries from its [download page](https://www.glfw.org/download). Available for Windows. - Depending on the build tool used, some extra steps may be needed: - - If using CMake, the `glfw3Config.cmake` and `glfw3Targets.cmake` files must be in a path that CMake searches by default or must be passed using `-DCMAKE_MODULE_PATH`. The official GLFW3 binaries do not ship these files on Windows, and so GLFW must either be compiled manually or obtained from [vcpkg](https://vcpkg.io/), which does ship the required cmake files. - - If the former approach is selected, CMake will be able to find GLFW on Windows if the environment variable `GLFW3_DIR` (or the cmake option `-DCMAKE_PREFIX_PATH`) is set to (contain) the folder owning `glfw3Config.cmake` and `glfw3Targets.cmake`. For instance, if GLFW was installed in `C:\Program Files(x86)\GLFW\`, this will most surely be something like `C:\Program Files (x86)\GLFW\lib\cmake\glfw3\`. - - If the latter, the vcpkg toolchain path should be passed to CMake using `-DCMAKE_TOOLCHAIN_FILE="/path/to/vcpkg/scripts/buildsystems/vcpkg.cmake"`. - - If using Visual Studio, the easiest way to obtain GLFW is by installing `glfw3` from vcpkg. Alternatively, the appropriate path to the GLFW3 library and header directories can be set in `Properties->Linker->General->Additional Library Directories` and `Properties->C/C++->General->Additional Include Directories`. When using this method, the appropriate name for the GLFW library should also be updated under `Properties->C/C++->Linker->Input->Additional Dependencies`. For instance, if the path to the root folder of the Windows binaries installation was `C:\glfw-3.3.8.bin.WIN64\` and we set `GLFW_DIR` with this path, the project configuration file (`.vcxproj`) should end up containing something similar to the following: - ``` + Depending on the build tool used, some extra steps may be needed: + + - If using CMake, the `glfw3Config.cmake` and `glfw3Targets.cmake` files must be in a path that CMake searches by default or must be passed using `-DCMAKE_MODULE_PATH`. The official GLFW3 binaries do not ship these files on Windows, and so GLFW must either be compiled manually or obtained from [vcpkg](https://vcpkg.io/), which does ship the required cmake files. + + - If the former approach is selected, CMake will be able to find GLFW on Windows if the environment variable `GLFW3_DIR` (or the cmake option `-DCMAKE_PREFIX_PATH`) is set to (contain) the folder owning `glfw3Config.cmake` and `glfw3Targets.cmake`. For instance, if GLFW was installed in `C:\Program Files(x86)\GLFW\`, this will most surely be something like `C:\Program Files (x86)\GLFW\lib\cmake\glfw3\`. + - If the latter, the vcpkg toolchain path should be passed to CMake using `-DCMAKE_TOOLCHAIN_FILE="/path/to/vcpkg/scripts/buildsystems/vcpkg.cmake"`. + + - If using Visual Studio, the easiest way to obtain GLFW is by installing `glfw3` from vcpkg. Alternatively, the appropriate path to the GLFW3 library and header directories can be set in `Properties->Linker->General->Additional Library Directories` and `Properties->C/C++->General->Additional Include Directories`. When using this method, the appropriate name for the GLFW library should also be updated under `Properties->C/C++->Linker->Input->Additional Dependencies`. For instance, if the path to the root folder of the Windows binaries installation was `C:\glfw-3.3.8.bin.WIN64\` and we set `GLFW_DIR` with this path, the project configuration file (`.vcxproj`) should end up containing something similar to the following: + + ```xml ... @@ -55,31 +68,44 @@ This example has additional library dependencies besides HIP: ``` ## Key APIs and Concepts + - `hipGLGetDevices(unsigned int* pHipDeviceCount, int* pHipDevices, unsigned int hipDeviceCount, hipGLDeviceList deviceList)` can be used to query which HIP devices can be used to share resources with the current OpenGL context. A device returned by this function must be selected using `hipSetDevice` or a stream must be created from such a device before OpenGL interop is possible. + - `hipGraphicsGLRegisterBuffer(hipGraphicsResource_t* resource, GLuint buffer, unsigned int flags)` is used to import an OpenGL buffer into HIP. `flags` affects how the resource is used in HIP. For example: -| flag | effect | -| -------------------------------------- | ----------------------------------------------- | -| `hipGraphicsRegisterFlagsNone` | HIP functions may read and write to the buffer. | -| `hipGraphicsRegisterFlagsReadOnly` | HIP functions may only read from the buffer. | -| `hiPGraphicsRegisterFlagsWriteDiscard` | HIP functions may only write to the buffer. | + + | flag | effect | + | -------------------------------------- | ----------------------------------------------- | + | `hipGraphicsRegisterFlagsNone` | HIP functions may read and write to the buffer. | + | `hipGraphicsRegisterFlagsReadOnly` | HIP functions may only read from the buffer. | + | `hiPGraphicsRegisterFlagsWriteDiscard` | HIP functions may only write to the buffer. | + - `hipGraphicsMapResources(int count, hipGraphicsResource_t* resources, hipStream_t stream = 0)` is used to make imported OpenGL resources available to a HIP device, either the current device or a device used by a specific stream. + - `hipGraphicsResourceGetMappedPointer(void** pointer, size_t* size, hipGraphicsResource_t resource)` is used to query the device pointer that represents the memory backing the OpenGL resource. The resulting pointer may be used as any other device pointer, like those obtained from `hipMalloc`. + - `hipGraphicsUnmapResources(int count, hipGraphicsResource_t* resources, hipStream_t stream = 0)` is used to unmap an imported resources from a HIP device or stream. + - `hipGraphicsUnregisterResource(hipGraphicsResource_t resource)` is used to unregister a previously imported OpenGL resource, so that it is no longer shared with HIP. ## Caveats + ### Multi-GPU systems + When using OpenGL-HIP interop on multi-gpu systems, the OpenGL context must be created with the device that should be used for rendering. This is not done in this example for brevity, but is required in specific scenarios. For example, consider a multi-gpu machine with an AMD and an NVIDIA GPU: when this example is compiled for the HIP runtime, it must be launched such that the AMD GPU is used to render. A simple workaround is to launch the program from the monitor that is physically connected to the GPU to use. For multi-gpu laptops running Linux with an integrated AMD or Intel GPU and an NVIDIA dedicated gpu, the example must be launched with `__GLX_VENDOR_LIBRARY_NAME=nvidia` when compiling for NVIDIA. ## Demonstrated API Calls + ### HIP runtime + #### Device symbols + - `threadIdx` - `blockIdx` - `blockDim` - `__global__` #### Host symbols + - `hipGetDeviceProperties` - `hipGetLastError` - `hipGLDeviceListAll` diff --git a/HIP-Basic/runtime_compilation/README.md b/HIP-Basic/runtime_compilation/README.md index a1c7e39e..23467970 100644 --- a/HIP-Basic/runtime_compilation/README.md +++ b/HIP-Basic/runtime_compilation/README.md @@ -7,6 +7,7 @@ Runtime compilation allows compiling fragments of source code to machine code at This example showcases how to make use of hipRTC to compile in runtime a kernel and launch it on a device. This kernel is a simple SAXPY, i.e. a single-precision operation $y_i=ax_i+y_i$. ### Application flow + The diagram below summarizes the runtime compilation part of the example. 1. A number of variables are declared and defined to configure the program which will be compiled in runtime. 2. The program is created using the above variables as parameters, along with the SAXPY kernel in string form. @@ -27,33 +28,39 @@ The diagram below summarizes the runtime compilation part of the example. 17. The first few elements of the result vector $y$ are printed to the standard output. ![hiprtc.svg](hiprtc.svg) + ## Key APIs and Concepts + - `hipGetDeviceProperties` extracts the properties of the desired device. In this example it is used to get the GPU architecture. - `hipModuleGetFunction` extracts a handle for a function with a certain name from a given module. Note that if no function with that name is present in the module this method will return an error. - `hipModuleLaunchKernel` queues the launch of the provided kernel on the device. This function normally presents an asynchronous behaviour (see `HIP_LAUNCH_BLOCKING`), i.e. a call to it may return before the device finishes the execution of the kernel. Its parameters are the following: - - The kernel to be launched. - - Number of blocks in the dimension X of kernel grid, i.e. the X component of grid size. - - Number of blocks in the dimension Y of kernel grid, i.e. the Y component of grid size. - - Number of blocks in the dimension Z of kernel grid, i.e. the Z component of grid size. - - Number of threads in the dimension X of each block, i.e. the X component of block size. - - Number of threads in the dimension Y of each block, i.e. the Y component of block size. - - Number of threads in the dimension Z of each block, i.e. the Z component of block size. - - Amount of dynamic shared memory that will be available to each workgroup, in bytes. Not used in this example. - - The device stream, on which the kernel should be dispatched. If 0 (or NULL), the NULL stream will be used. In this example the latter is used. - - Pointer to the arguments needed by the kernel. Note that this parameter is not yet implemented, and thus the _extra_ parameter (the last one described in this list) should be used to pass arguments to the kernel. - - Pointer to all extra arguments passed to the kernel. They must be in the memory layout and alignment expected by the kernel. The list of arguments must end with `HIP_LAUNCH_PARAM_END`. + + - The kernel to be launched. + - Number of blocks in the dimension X of kernel grid, i.e. the X component of grid size. + - Number of blocks in the dimension Y of kernel grid, i.e. the Y component of grid size. + - Number of blocks in the dimension Z of kernel grid, i.e. the Z component of grid size. + - Number of threads in the dimension X of each block, i.e. the X component of block size. + - Number of threads in the dimension Y of each block, i.e. the Y component of block size. + - Number of threads in the dimension Z of each block, i.e. the Z component of block size. + - Amount of dynamic shared memory that will be available to each workgroup, in bytes. Not used in this example. + - The device stream, on which the kernel should be dispatched. If 0 (or NULL), the NULL stream will be used. In this example the latter is used. + - Pointer to the arguments needed by the kernel. Note that this parameter is not yet implemented, and thus the _extra_ parameter (the last one described in this list) should be used to pass arguments to the kernel. + - Pointer to all extra arguments passed to the kernel. They must be in the memory layout and alignment expected by the kernel. The list of arguments must end with `HIP_LAUNCH_PARAM_END`. + - `hipModuleLoadData` builds a module from a code (compiled binary) object residing in host memory and loads it into the current context. Note that in this example this function is called right after `hipMalloc`. This is due to the fact that, on CUDA, `hipModuleLoadData` will fail if it is not called after some runtime API call is done (as it will implicitly intialize a current context) or if there is not an explicit creation of a (current) context. - `hipModuleUnload` unloads the specified module from the current context and frees it. - `hiprtcCompileProgram` compiles the given program in runtime. Some compilation options may be passed as parameters to this function. In this example, the GPU architeture is the only compilation option. - `hiprtcCreateProgram` instantiates a runtime compilation program from the given parameters. Those are the following: - - The runtime compilation program object that will be set with the new instance. - - A pointer to the program source code. - - A pointer to the program name. - - The number of headers to be included. - - An array of pointers to the headers names. - - An array of pointers to the names to be included in the source program. + + - The runtime compilation program object that will be set with the new instance. + - A pointer to the program source code. + - A pointer to the program name. + - The number of headers to be included. + - An array of pointers to the headers names. + - An array of pointers to the names to be included in the source program. In this example the program is created including two header files to illustrate how to pass all of the above arguments to this function. + - `hiprtcDestroyProgram` destroys an instance of a given runtime compilation program object. - `hiprtcGetProgramLog` extracts the char pointer to the log generated during the compilation of a given runtime compilation program. - `hiprtcGetProgramLogSize` returns the compilation log size of a given runtime compilation program, measured as number of characters. @@ -65,9 +72,11 @@ The diagram below summarizes the runtime compilation part of the example. ### HIP runtime #### Device symbols + - `threadIdx`, `blockIdx`, `blockDim` #### Host symbols + - `hipFree` - `hipGetDeviceProperties` - `hipGetLastError` diff --git a/HIP-Basic/saxpy/README.md b/HIP-Basic/saxpy/README.md index 7cf124ee..f46e151e 100644 --- a/HIP-Basic/saxpy/README.md +++ b/HIP-Basic/saxpy/README.md @@ -1,9 +1,11 @@ # HIP-Basic "SAXPY" Example ## Description + This program demonstrates a simple implementation of the "SAXPY" kernel. The "S" stands for single-precision (i.e. `float`) and "AXPY" stands for the operation performed: $Y_i=aX_i+Y_i$. The simple nature of this example makes it an ideal starting point for developers who are just getting introduced to HIP. -### Application flow +### Application flow + 1. A number of constants are defined to control the problem details and the kernel launch parameters. 2. The two input vectors, $X$ and $Y$ are instantiated in host memory. $X$ is filled with an incrementing sequence starting from 1, whereas $Y$ is filled with ones. 3. The necessary amount of device (GPU) memory is allocated and the elements of the input vectors are copied to the device memory. @@ -14,24 +16,29 @@ This program demonstrates a simple implementation of the "SAXPY" kernel. The "S" 8. The first few elements of the result vector are printed to the standard output. ## Key APIs and Concepts + - `hipMalloc` is used to allocate memory in the global memory of the device (GPU). This is usually necessary, since the kernels running on the device cannot access host (CPU) memory (unless it is device-accessible pinned host memory, see `hipHostMalloc`). Beware, that the memory returned is uninitialized. - `hipFree` de-allocates device memory allocated by `hipMalloc`. It is necessary to free no longer used memory with this function to avoid resource leakage. - `hipMemcpy` is used to transfer bytes between the host and the device memory in both directions. A call to it synchronizes the device with the host, meaning that all kernels queued before `hipMemcpy` will finish before the copying starts. The function returns once the copying has finished. - `myKernelName<<>>(kernelArguments)` queues the execution of the provided kernel on the device. It is asynchronous, the call may return before the execution of the kernel is finished. Its arguments come as the following: - - The kernel (`__global__`) function to launch. - - The number of blocks in the kernel grid, i.e. the grid size. It can be up to 3 dimensions. - - The number of threads in each block, i.e. the block size. It can be up to 3 dimensions. - - The amount of dynamic shared memory provided for the kernel, in bytes. Not used in this example. - - The device stream, on which the kernel is queued. In this example, the default stream is used. - - All further arguments are passed to the kernel function. Notice, that built-in and simple (POD) types may be passed to the kernel, but complex ones (e.g. `std::vector`) usually cannot be. + - The kernel (`__global__`) function to launch. + - The number of blocks in the kernel grid, i.e. the grid size. It can be up to 3 dimensions. + - The number of threads in each block, i.e. the block size. It can be up to 3 dimensions. + - The amount of dynamic shared memory provided for the kernel, in bytes. Not used in this example. + - The device stream, on which the kernel is queued. In this example, the default stream is used. + - All further arguments are passed to the kernel function. Notice, that built-in and simple (POD) types may be passed to the kernel, but complex ones (e.g. `std::vector`) usually cannot be. - `hipGetLastError` returns the error code resulting from the previous operation. ## Demonstrated API Calls + ### HIP runtime + #### Device symbols + - `threadIdx`, `blockIdx`, `blockDim` #### Host symbols + - `hipMalloc` - `hipFree` - `hipMemcpy` diff --git a/HIP-Basic/shared_memory/README.md b/HIP-Basic/shared_memory/README.md index 7e389b96..291c78c0 100644 --- a/HIP-Basic/shared_memory/README.md +++ b/HIP-Basic/shared_memory/README.md @@ -1,24 +1,27 @@ # HIP-Basic Shared Memory Example ## Description -The shared memory is an on-chip type of memory that is visible to all the threads within the same block, allowing them to communicate by writing and reading data from the same memory space. However, some synchronization among the threads of the block is needed to ensure that all of them have written before trying to access the data. -When using the appropriate access pattern, this memory can provide much less latency than local or global memory (nearly as much as registers), making it a much better option in certain cases. If the size of the shared memory to be used is known at compile time, it can be explicitly specified and it is then known as static shared memory. +The shared memory is an on-chip type of memory that is visible to all the threads within the same block, allowing them to communicate by writing and reading data from the same memory space. However, some synchronization among the threads of the block is needed to ensure that all of them have written before trying to access the data. + +When using the appropriate access pattern, this memory can provide much less latency than local or global memory (nearly as much as registers), making it a much better option in certain cases. If the size of the shared memory to be used is known at compile time, it can be explicitly specified and it is then known as static shared memory. This example implements a simple matrix transpose kernel to showcase how to use static shared memory. -### Application flow +### Application flow + 1. A number of constants are defined for the kernel launch parameters. 2. The input and output matrices are allocated and initialized in host memory. 3. The necessary amount of device memory for the input and output matrices is allocated and the input data is copied to the device. 4. A trace message is printed to the standard output. -5. The GPU kernel is then launched with the previously defined arguments. +5. The GPU kernel is then launched with the previously defined arguments. 6. The transposed matrix is copied back to host memory. 7. All device memory is freed. 8. The expected transposed matrix is calculated with a CPU version of the transpose kernel and the transposed matrix obtained from the kernel execution is then compared with it. The result of the comparison is printed to the standard output. ## Key APIs and Concepts -- `__shared__` is a variable declaration specifier necessary to allocate shared memory from the device. + +- `__shared__` is a variable declaration specifier necessary to allocate shared memory from the device. - `__syncthreads` allows to synchronize all the threads within the same block. This synchronization barrier is used to ensure that every thread in a block have finished writing in shared memory before another threads in the block try to access that data. - `hipMalloc` allocates host device memory in global memory, and with `hipMemcpy` data bytes can be transferred from host to device (using `hipMemcpyHostToDevice`) or from device to host (using `hipMemcpyDeviceToHost`), among others. - `myKernelName<<<...>>>` queues the execution of a kernel on a device (GPU). @@ -28,16 +31,19 @@ This example implements a simple matrix transpose kernel to showcase how to use ## Demonstrated API Calls ### HIP runtime + - `__global__` - `__shared__` #### Device symbols + - `blockDim` - `blockIdx` - `threadIdx` - `__syncthreads` #### Host symbols + - `hipFree` - `hipGetLastError` - `hipMalloc` diff --git a/HIP-Basic/static_device_library/README.md b/HIP-Basic/static_device_library/README.md index 1b1ef2fe..31dd5979 100644 --- a/HIP-Basic/static_device_library/README.md +++ b/HIP-Basic/static_device_library/README.md @@ -1,9 +1,11 @@ # HIP-Basic Device Static Library Example ## Description + This example shows how to create a static library that exports device functions. ### Application flow + 1. A number of constants for the example problem are initialized. 2. A host vector is prepared with an increasing sequence of integers starting from 0. 3. The necessary amount of device (GPU) memory is allocated and the elements of the input vectors are copied to the device memory. @@ -15,30 +17,43 @@ This example shows how to create a static library that exports device functions. 9. The results from the device are compared with the expected results on the host. An error message is printed if the results were not as expected and the function returns with an error code. ## Build Process + Compiling a HIP static library that exports device functions must be done in two steps: + 1. First, the source files that make up the library must be compiled to object files. This is done similarly to how an object file is created for a regular source file (using the `-c` flag), except that the additional option `-fgpu-rdc` must be passed: + ```shell hipcc -c -fgpu-rdc -Ilibrary library/library.hip -o library.o ``` + 2. After compiling all library sources into object files, they must be manually bundled into an archive that can act as static library. `hipcc` cannot currently create this archive automatically, hence it must be created manually using `ar`: + ```shell ar rcsD liblibrary.a library.o ``` + After the static device library has been compiled, it can be linked with another HIP program or library. Linking with a static device library is done by placing it on the command line directly, and additionally requires `-fgpu-rdc`. The static library should be placed on the command line _before_ any source files. Source files that use the static library can also be compiled to object files first, in this case they also need to be compiled with `-fgpu-rdc`: + ```shell hipcc -fgpu-rdc liblibrary.a main.hip -o hip_static_device_library ``` + **Note**: static device libraries _must_ be linked with `hipcc`. There is no support yet for linking such libraries with (ROCm-bundled) clang, using CMake, or using Visual Studio. ## Demonstrated API Calls + ### HIP runtime + #### Device symbols + - `blockDim` - `blockIdx` - `threadIdx` - `__device__` - `__global__` + #### Host symbols + - `hipMalloc` - `hipMemcpy` - `hipGetLastError` diff --git a/HIP-Basic/static_host_library/README.md b/HIP-Basic/static_host_library/README.md index 9d90df51..d364020e 100644 --- a/HIP-Basic/static_host_library/README.md +++ b/HIP-Basic/static_host_library/README.md @@ -1,9 +1,11 @@ # HIP-Basic Host Static Library Example ## Description + This example shows how to create a static library that exports hosts functions. The library may contain both `__global__` and `__device__` code as well, but in this example only `__host__` functions are exported. The resulting library may be linked with other libraries or programs, which do not necessarily need to be HIP libraries or programs. A static host library appears as a regular library, and is compatible with either hipcc or the native system's linker. When using the system linker, the libraries or applications using the static host library do need to be linked with `libamdhip64`. ### Application flow + 1. The `main` function in `main.cpp` calls the library's sole exported function, `run_test`. This symbol is made visible by including the static library's header file. 2. In `run_test` in `library/library.hip`, a number of constants for the example problem are initialized. 3. A vector with input data is initialized in host memory. It is filled with an incrementing sequence starting from 0. @@ -15,49 +17,67 @@ This example shows how to create a static library that exports hosts functions. 9. Control flow returns to `main` in `main.cpp`, which exits the program with the value that was returned from `run_test`. ## Build Process + A HIP static host library is built the same as a regular application, except that the additional flag `--emit-static-lib` must be passed to `hipcc`. Additionally, the library should be compiled with position independent code enabled: + ```shell hipcc library/library.hip -o liblibrary.a --emit-static-lib -fPIC ``` + Linking the static library with another library or object is done in the same way as a regular library: + ```shell hipcc -llibrary -Ilibrary main.cpp -o hip_static_host_library ``` + Note that when linking the library using the host compiler or linker, such as `g++` or `clang++`, the `amdhip64` library should be linked with additionally: + ```shell g++ -L/opt/rocm/lib -llibrary -lamdhip64 -Ilibrary main.cpp -o hip_static_host_library ``` ### CMake + Building a HIP static host library can be done using the CMake `add_library` command: + ```cmake add_library(library_name STATIC library/library.hip) target_include_directories(library_name PUBLIC library) ``` + Note that while the required compilation flags to create a library are passed to the compiler automatically by CMake, position independent code must be turned on manually: + ```cmake set_target_properties(${library_name} PROPERTIES POSITION_INDEPENDENT_CODE ON) ``` + Linking with the static library is done in the same way as regular libraries. If used via `target_link_libraries`, this automatically adds the `amdhip64` dependency: + ```cmake add_executable(excutable_name main.cpp) target_link_libraries(executable_name library_name) ``` ### Visual Studio 2019 + When using Visual Studio 2019 to build a HIP static host library, a separate project can be used to build the static library. This can be set up from scratch by creating a new AMD HIP C++ project, and then converting it to a library by setting `[right click project] -> Properties -> Configuration Properties -> General -> Configuration Type` to `Library`. Linking with a HIP static host library can then be done simply by adding a reference to the corresponding project. This can be done under `[right click project] -> Add -> Reference` by checking the checkbox of the library project, and works both for AMD HIP C++ Visual Studio projects (demonstrated in [static_host_library_vs2019.vcxproj](./static_host_library_vs2019.vcxproj)) as well as regular Windows application Visual Studio projects (demonstrated in [static_host_library_msvc_vs2019.vcxproj](./static_host_library_msvc/static_host_library_msvc_vs2019.vcxproj)). ## Demonstrated API Calls + ### HIP runtime + #### Device symbols + - `blockDim` - `blockIdx` - `threadIdx` - `__device__` - `__global__` + #### Host symbols + - `hipMalloc` - `hipMemcpy` - `hipGetLastError` diff --git a/HIP-Basic/streams/README.md b/HIP-Basic/streams/README.md index 03a50010..950675c3 100644 --- a/HIP-Basic/streams/README.md +++ b/HIP-Basic/streams/README.md @@ -1,9 +1,11 @@ # HIP-Basic Streams Example ## Description + A stream encapsulates a queue of tasks that are launched on the GPU device. This example showcases usage of multiple streams, each with their own tasks. These tasks include asynchronous memory copies using `hipMemcpyAsync` and asynchronous kernel launches using `myKernelName<<<...>>>`. ### Application flow + 1. Host side input and output memory is allocated using `hipHostMalloc` as pinned memory. It will ensure that the memory copies will be performed asynchronously when using `hipMemcpyAsync`. 2. Host input is instantiated. 3. Device side storage is allocated using `hipMalloc`. @@ -17,10 +19,13 @@ A stream encapsulates a queue of tasks that are launched on the GPU device. This 11. Free host side pinned memory using `hipHostFree`. ## Key APIs and Concepts + A HIP stream allows device tasks to be grouped and launched asynchronously and independently from other tasks, which can be used to hide latencies and increase task completion throughput. When results of a task queued on a particular stream are needed, it can be explicitly synchronized without blocking work queued on other streams. Each HIP stream is tied to a particular device, which enables HIP streams to be used to schedule work across multiple devices simultaneously. ## Demonstrated API Calls + ### HIP runtime + - `__shared__` - `__syncthreads` - `hipStream_t` diff --git a/HIP-Basic/texture_management/README.md b/HIP-Basic/texture_management/README.md index 8df3e5de..10517927 100644 --- a/HIP-Basic/texture_management/README.md +++ b/HIP-Basic/texture_management/README.md @@ -1,9 +1,11 @@ # HIP-Basic Texture Management Example ## Description + This example demonstrates how a kernel may use texture memory through the texture object API. Using texture memory may be beneficial as the texture cache is optimized for 2D spatial locality and exposes features such as hardware filtering. In the example, a texture is created using a device array and is sampled in a kernel to create a histogram of its values. -### Application flow +### Application flow + 1. Check whether texture functions are supported on the device. 2. Initialize the texture data on host side. 3. Specify the channel description of the texture and allocate a device array based on the texture dimensions and channel descriptor. @@ -15,16 +17,20 @@ This example demonstrates how a kernel may use texture memory through the textur 9. Destroy the texture object and release resources. ## Key APIs and Concepts + - The memory for the texture may be a device array `hipArray_t`, which is allocated with `hipMallocArray`. The allocation call requires a channel descriptor `hipChannelFormatDesc` and the dimensions of the texture. The channel descriptor can be created using `hipCreateChannelDesc`. Host data can be transferred to the device array using `hipMemcpy2DToArray`. - The texture object `hipTextureObject_t` is created with `hipCreateTextureObject`, which requires a resource descriptor `hipResourceDesc` and a texture descriptor `hipTextureDesc`. The resource descriptor describes the resource used to create the texture, in this example a device array `hipResourceTypeArray`. The texture descriptor describes the properties of the texture, such as its addressing mode and whether it uses normalized coordinates. - The created texture object can be sampled in a kernel using `tex2D`. - The texture object is cleaned up by calling `hipDestroyTextureObject` and the device array is cleaned up by calling `hipFreeArray`. ## Demonstrated API Calls + ### HIP runtime + - `__global__` #### Device symbols + - `atomicAdd` - `blockDim` - `blockIdx` @@ -32,6 +38,7 @@ This example demonstrates how a kernel may use texture memory through the textur - `threadIdx` #### Host symbols + - `hipArray_t` - `hipAddressModeWrap` - `hipChannelFormatDesc` diff --git a/HIP-Basic/vulkan_interop/README.md b/HIP-Basic/vulkan_interop/README.md index d177dfa6..094d74fd 100644 --- a/HIP-Basic/vulkan_interop/README.md +++ b/HIP-Basic/vulkan_interop/README.md @@ -1,10 +1,13 @@ # HIP-Basic Vulkan Interop Example ## Description + External device resources and other handles can be shared with HIP in order to provide interoperability between different GPU APIs. This example showcases a HIP program that interacts with the Vulkan API: A HIP kernel is used to simulate a sine wave over a grid of points, in a buffer that is shared with Vulkan. The resulting data is then rendered to a window using the Vulkan API. A set of shared semaphores is used to guarantee synchronous access to the device memory shared between HIP and Vulkan. ### Application flow + #### Initialization + 1. A window is opened using the GLFW library. 2. The Vulkan API is initialized: Function pointers are loaded, the Vulkan instance is created. 3. A physical device is picked to execute the example kernel on and to render the result to the window. This physical device must be the same for HIP and for Vulkan in order to be able to share the required resources. This is done by comparing the device's UUID, which can be queried from a HIP device by querying `hipDeviceGetUuid` and from a Vulkan physical device by passing `VkPhysicalDeviceIDProperties` to `vkGetPhysicalDeviceProperties2`. If the UUIDs from a particular HIP device and Vulkan device are the same, they represent the same physical or virtual device. @@ -20,7 +23,9 @@ External device resources and other handles can be shared with HIP in order to p 13. The Vulkan semaphores are converted to HIP external semaphores. This is done by first exporting a Vulkan semaphore handle to a native semaphore handle, either by `vkGetSemaphoreFdKHR` or `vkGetSemaphoreWin32HandleKHR` depending on the target platform. The resulting handle is passed to `hipImportExternalSemaphore` to obtain the HIP semaphore handle. #### Rendering + A frame is rendered as follows: + 1. The frame resources for the current frame in the frame pipeline are fetched from memory. 2. The next image index is acquired from the swapchain. 3. The command pool associated to the current frame is reset and the associated command buffer is initialized. @@ -32,6 +37,7 @@ A frame is rendered as follows: 9. The swapchain is asked to present the current frame to the screen. ## Key APIs and Concepts + To share memory allocated by Vulkan with HIP, the `VkDeviceMemory` must be created by passing the `VkExportMemoryAllocateInfoKHR` structure to `vkAllocateDeviceMemory`. This structure needs the appropriate `handleTypes` set to a type that can be shared with HIP for the current platform; `VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT_KHR` for Linux and `VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT_KHR` or `VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_KMT_BIT_KHR` for Windows. Any Vulkan buffer that is to be associated with this device memory must similarly be created by passing `VkExternalMemoryBufferCreateInfoKHR` to `vkCreateBuffer`, of which the `handleTypes` member must be initialized to the same value. The `VkDeviceMemory` handle can then be exported to a native file descriptor or `HANDLE` using `vkGetMemoryFdKHR` or `vkGetMemoryWin32HandleKHR` respectively on Linux and Windows. A `hipExternalMemory_t` can then be imported from a native handle through `hipImportExternalMemory`. This function must be passed an instance of `hipExternalmemoryHandleDesc`, of which `type` is initialized with a handle type compatible with the Vulkan `handleTypes`. This mapping is as follows: | Vulkan memory handle type | HIP memory handle type | | --------------------------------------------------------- | ------------------------------------------- | @@ -53,68 +59,86 @@ To wait on a shared semaphore in HIP, `hipWaitExternalSemaphoresAsync` should be To signal a shared semaphore in HIP, the `hipSignalExternalSemaphoresAsync` function can be used. This must be passed a number of `hipExternalSemaphoreSignalParams` structures, each corresponding to a semaphore with the same index. When using timeline semaphores, its `fence.value` member should be set to specify the value to which the semaphore should be set. ## Dependencies + This example has additional library dependencies besides HIP: + - [GLFW](https://glfw.org). There are three options for getting this dependency satisfied: - 1. Install it through a package manager. Available for Linux, where GLFW can be installed from some of the usual package managers: - - APT: `apt-get install libglfw3-dev` - - Pacman: `pacman -S glfw-x11` or `pacman -S glfw-wayland` - - DNF: `dnf install glfw-devel` - - It could also happen that the `Xxf68vm` and `Xi` libraries required when linking against Vulkan are not installed. They can be found as well on the previous package managers: - - APT: `apt-get install libxxf86vm-dev libxi-dev` - - Pacman: `pacman -S libxi libxxf86vm` - - DNF: `dnf install libXi-devel libXxf86vm-devel` - 2. Build from source. GLFW supports compilation on Windows with Visual C++ (2010 and later), MinGW and MinGW-w64 and on Linux and other Unix-like systems with GCC and Clang. Please refer to the [compile guide](https://www.glfw.org/docs/latest/compile.html) for a complete guide on how to do this. Note: not only it should be built as explained in the guide, but it is additionally needed to build with the install target (`cmake --build --target install`). - 3. Get the pre-compiled binaries from its [download page](https://www.glfw.org/download). Available for Windows. - - Depending on the build tool used, some extra steps may be needed: - - If using CMake, the `glfw3Config.cmake` and `glfw3Targets.cmake` files must be in a path that CMake searches by default or must be passed using `-DCMAKE_MODULE_PATH`. The official GLFW3 binaries do not ship these files on Windows, and so GLFW must either be compiled manually or obtained from [vcpkg](https://vcpkg.io/), which does ship the required cmake files. - - If the former approach is selected, CMake will be able to find GLFW on Windows if the environment variable `GLFW3_DIR` (or the cmake option `-DCMAKE_PREFIX_PATH`) is set to (contain) the folder owning `glfw3Config.cmake` and `glfw3Targets.cmake`. For instance, if GLFW was installed in `C:\Program Files(x86)\GLFW\`, this will most surely be something like `C:\Program Files (x86)\GLFW\lib\cmake\glfw3\`. - - If the latter, the vcpkg toolchain path should be passed to CMake using `-DCMAKE_TOOLCHAIN_FILE="/path/to/vcpkg/scripts/buildsystems/vcpkg.cmake"`. - - If using Visual Studio, the easiest way to obtain GLFW is by installing `glfw3` from vcpkg. Alternatively, the appropriate path to the GLFW3 library and header directories can be set in `Properties->Linker->General->Additional Library Directories` and `Properties->C/C++->General->Additional Include Directories`. When using this method, the appropriate name for the GLFW library should also be updated under `Properties->C/C++->Linker->Input->Additional Dependencies`. For instance, if the path to the root folder of the Windows binaries installation was `C:\glfw-3.3.8.bin.WIN64\` and we set `GLFW_DIR` with this path, the project configuration file (`.vcxproj`) should end up containing something similar to the following: - ``` - - - ... - $(GLFW_DIR)\include\;;%(AdditionalIncludeDirectories) - ... - - - ... - glfw3dll.lib;;%(AdditionalDependencies) - $(GLFW_DIR)\lib; - ... - - - ``` + + 1. Install it through a package manager. Available for Linux, where GLFW can be installed from some of the usual package managers: + - APT: `apt-get install libglfw3-dev` + - Pacman: `pacman -S glfw-x11` or `pacman -S glfw-wayland` + - DNF: `dnf install glfw-devel` + + It could also happen that the `Xxf68vm` and `Xi` libraries required when linking against Vulkan are not installed. They can be found as well on the previous package managers: + - APT: `apt-get install libxxf86vm-dev libxi-dev` + - Pacman: `pacman -S libxi libxxf86vm` + - DNF: `dnf install libXi-devel libXxf86vm-devel` + + 2. Build from source. GLFW supports compilation on Windows with Visual C++ (2010 and later), MinGW and MinGW-w64 and on Linux and other Unix-like systems with GCC and Clang. Please refer to the [compile guide](https://www.glfw.org/docs/latest/compile.html) for a complete guide on how to do this. Note: not only it should be built as explained in the guide, but it is additionally needed to build with the install target (`cmake --build --target install`). + + 3. Get the pre-compiled binaries from its [download page](https://www.glfw.org/download). Available for Windows. + + Depending on the build tool used, some extra steps may be needed: + + - If using CMake, the `glfw3Config.cmake` and `glfw3Targets.cmake` files must be in a path that CMake searches by default or must be passed using `-DCMAKE_MODULE_PATH`. The official GLFW3 binaries do not ship these files on Windows, and so GLFW must either be compiled manually or obtained from [vcpkg](https://vcpkg.io/), which does ship the required cmake files. + + - If the former approach is selected, CMake will be able to find GLFW on Windows if the environment variable `GLFW3_DIR` (or the cmake option `-DCMAKE_PREFIX_PATH`) is set to (contain) the folder owning `glfw3Config.cmake` and `glfw3Targets.cmake`. For instance, if GLFW was installed in `C:\Program Files(x86)\GLFW\`, this will most surely be something like `C:\Program Files (x86)\GLFW\lib\cmake\glfw3\`. + - If the latter, the vcpkg toolchain path should be passed to CMake using `-DCMAKE_TOOLCHAIN_FILE="/path/to/vcpkg/scripts/buildsystems/vcpkg.cmake"`. + + - If using Visual Studio, the easiest way to obtain GLFW is by installing `glfw3` from vcpkg. Alternatively, the appropriate path to the GLFW3 library and header directories can be set in `Properties->Linker->General->Additional Library Directories` and `Properties->C/C++->General->Additional Include Directories`. When using this method, the appropriate name for the GLFW library should also be updated under `Properties->C/C++->Linker->Input->Additional Dependencies`. For instance, if the path to the root folder of the Windows binaries installation was `C:\glfw-3.3.8.bin.WIN64\` and we set `GLFW_DIR` with this path, the project configuration file (`.vcxproj`) should end up containing something similar to the following: + + ```xml + + + ... + $(GLFW_DIR)\include\;;%(AdditionalIncludeDirectories) + ... + + + ... + glfw3dll.lib;;%(AdditionalDependencies) + $(GLFW_DIR)\lib; + ... + + + ``` + - Vulkan headers. On Linux, the vulkan headers can be directly obtained from some package managers: - - Linux - - APT: `apt-get install -y libvulkan-dev` - - Pacman: `pacman -S vulkan-headers vulkan-icd-loader` - - DNF: `dnf install vulkan-headers vulkan-icd-loader` - But they may be as well obtained by installing the [LunarG Vulkan SDK](https://vulkan.lunarg.com/). CMake will be able to find the SDK using the `VULKAN_SDK` environment variable, which is set by default using the SDK activation script. + - Linux + + - APT: `apt-get install -y libvulkan-dev` + - Pacman: `pacman -S vulkan-headers vulkan-icd-loader` + - DNF: `dnf install vulkan-headers vulkan-icd-loader` + + But they may be as well obtained by installing the [LunarG Vulkan SDK](https://vulkan.lunarg.com/). CMake will be able to find the SDK using the `VULKAN_SDK` environment variable, which is set by default using the SDK activation script. - On Windows, on the other hand, the headers can only be obtained from the [LunarG Vulkan SDK](https://vulkan.lunarg.com/). Contrary to Unix-based OSs, the `VULKAN_SDK` environment variable is not automatically provided on Windows, and so it should be set to the appropriate path before invoking CMake. + On Windows, on the other hand, the headers can only be obtained from the [LunarG Vulkan SDK](https://vulkan.lunarg.com/). Contrary to Unix-based OSs, the `VULKAN_SDK` environment variable is not automatically provided on Windows, and so it should be set to the appropriate path before invoking CMake. - Note that `libvulkan` is _not_ required, as the example loads function pointers dynamically. + Note that `libvulkan` is _not_ required, as the example loads function pointers dynamically. - Validation layers. The `VK_LAYER_KHRONOS_validation` layer is active by default to perform general checks on Vulkan, thus the [Khronos' Vulkan Validation Layers (VVL)](https://github.com/KhronosGroup/Vulkan-ValidationLayers/tree/main#vulkan-validation-layers-vvl) will need to be installed on the system if such checks are desirable. It can be either installed from a package manager (on Linux), built and configured from source or installed as part of the [LunarG Vulkan SDK](https://vulkan.lunarg.com/). -Package managers offering the validation layers package include: - - APT: `apt install vulkan-validationlayers-dev` - - Pacman: `pacman -S vulkan-validation-layers`. Note that with pacman both the validation layers and headers (among others) can be also installed with `pacman -S vulkan-devel`. - - DNF: `dnf install vulkan-validation-layers` - For the second approach, build instructions are provided on [Khronos Vulkan-ValidationLayers repository](https://github.com/KhronosGroup/Vulkan-ValidationLayers/blob/main/BUILD.md) and Vulkan's [Layers Overwiew and Configuration](https://vulkan.lunarg.com/doc/view/latest/windows/layer_configuration.html) document offers several approaches for its configuration. + Package managers offering the validation layers package include: + + - APT: `apt install vulkan-validationlayers-dev` + - Pacman: `pacman -S vulkan-validation-layers`. Note that with pacman both the validation layers and headers (among others) can be also installed with `pacman -S vulkan-devel`. + - DNF: `dnf install vulkan-validation-layers` + + For the second approach, build instructions are provided on [Khronos Vulkan-ValidationLayers repository](https://github.com/KhronosGroup/Vulkan-ValidationLayers/blob/main/BUILD.md) and Vulkan's [Layers Overwiew and Configuration](https://vulkan.lunarg.com/doc/view/latest/windows/layer_configuration.html) document offers several approaches for its configuration. - `glslangValidator`. It is used in the example as a shader validation tool. It may be installed via package manager (`sudo apt install glslang-tools`), by [building manually from source](https://github.com/KhronosGroup/glslang#building-cmake), by downloading the binaries for the corresponding platform directly from the [main-tot](https://github.com/KhronosGroup/glslang/releases/tag/main-tot) release on GitHub or installed as part of the [LunarG Vulkan SDK](https://vulkan.lunarg.com/). ## Demonstrated API Calls + ### HIP runtime + #### Device symbols + - `threadIdx`, `blockIdx`, `blockDim` #### Host symbols + - `hipComputeModeProhibited` - `hipCUDAErrorTohipError` - `hipDestroyExternalMemory` diff --git a/HIP-Basic/warp_shuffle/README.md b/HIP-Basic/warp_shuffle/README.md index a8bfa260..2c36d5ce 100644 --- a/HIP-Basic/warp_shuffle/README.md +++ b/HIP-Basic/warp_shuffle/README.md @@ -1,6 +1,7 @@ # HIP-Basic Warp Shuffle Example ## Description + Kernel code for a particular block is executed in groups of threads known as a _wavefronts_ (AMD) or _warps_ (NVIDIA). Each block is is divided into as many warps as the block's size allows. If the block size is less than the warp size, then part of the warp just stays idle (as happens in this example). AMD GPUs use 64 threads per wavefront for architectures prior to RDNAâ„¢ 1. RDNA architectures support both 32 and 64 wavefront sizes. Warps are executed in _lockstep_, i.e. all the threads in each warp execute the same instruction at the same time but with different data. This type of parallel processing is also known as Single Instruction, Multiple Data (SIMD). A block contains several warps and the warp size is dependent on the architecture, but the block size is not. Blocks and warps also differ in the way they are executed, and thus they may provide different results when used in the same piece of code. For instance, the kernel code of this example would not work as it is with block execution and shared memory access e.g. because some synchronization would be needed to ensure that every thread has written its correspondent value before trying to access it. @@ -10,6 +11,7 @@ Higher performance in the execution of kernels can be achieved with explicit war This example showcases how to use the above-mentioned operations by implementing a simple matrix transpose kernel. ### Application flow + 1. A number of constants are defined for the kernel launch parameters. 2. The input and output matrices are allocated and initialized in host memory. 3. The necessary amount of device memory for the input and output matrices is allocated and the input data is copied to the device. @@ -20,6 +22,7 @@ This example showcases how to use the above-mentioned operations by implementing 8. The expected transposed matrix is calculated with a CPU version of the transpose kernel and the transposed matrix obtained from the kernel execution is then compared with it. The result of the comparison is printed to the standard output. ## Key APIs and Concepts + Warp shuffle is a warp-level primitive that allows for the communication between the threads of a warp. Below is a simple example that shows how the value of the thread with index 2 is copied to all other threads within the warp. ![warp_shuffle_simple.svg](warp_shuffle_simple.svg) @@ -38,11 +41,13 @@ Warp shuffle is a warp-level primitive that allows for the communication between ### HIP runtime #### Device symbols + - `__global__` - `threadIdx` - `__shfl` #### Host symbols + - `hipFree` - `hipGetDeviceProperties` - `hipGetLastError` diff --git a/LICENSE.md b/LICENSE.md index cd4c8d43..b1db69f2 100644 --- a/LICENSE.md +++ b/LICENSE.md @@ -6,4 +6,4 @@ Permission is hereby granted, free of charge, to any person obtaining a copy of The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. \ No newline at end of file +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. diff --git a/LLVM_ASAN/Using-Address-Sanitizer-with-a-Short-HIP-Application.md b/LLVM_ASAN/Using-Address-Sanitizer-with-a-Short-HIP-Application.md index c0b9c77f..cb8d31c4 100644 --- a/LLVM_ASAN/Using-Address-Sanitizer-with-a-Short-HIP-Application.md +++ b/LLVM_ASAN/Using-Address-Sanitizer-with-a-Short-HIP-Application.md @@ -3,7 +3,6 @@ Consider the following simple and short demo of using the Address Sanitizer with a HIP application: ```C++ - #include #include @@ -48,8 +47,8 @@ Switching to `--offload-arch=gfx90a:xnack+` in the command above results in a warning-free compilation and an instrumented application. After setting `PATH`, `LD_LIBRARY_PATH` and `HSA_XNACK` as described earlier, a check of the binary with `ldd` yields -``` +```shell $ ldd mini linux-vdso.so.1 (0x00007ffd1a5ae000) libclang_rt.asan-x86_64.so => /opt/rocm-5.7.0-99999/llvm/lib/clang/17.0.0/lib/linux/libclang_rt.asan-x86_64.so (0x00007fb9c14b6000) @@ -75,20 +74,16 @@ $ ldd mini This confirms that the address sanitizer runtime is linked in, and the ASAN instrumented version of the runtime libraries are used. Checking the `PATH` yields -``` - +```shell $ which llvm-symbolizer /opt/rocm-5.7.0-99999/llvm/bin/llvm-symbolizer - ``` Lastly, a check of the OS kernel version yields -``` - +```shell $ uname -rv 5.15.0-73-generic #80~20.04.1-Ubuntu SMP Wed May 17 14:58:14 UTC 2023 - ``` which indicates that the required HMM support (kernel version > 5.6) is available. @@ -96,8 +91,7 @@ This completes the necessary setup. Running with `m = 100`, `n1 = 11`, `n2 = 10` and `c = 100` should produce a report for an invalid access by the last 10 threads. -``` - +```gdb ================================================================= ==3141==ERROR: AddressSanitizer: heap-buffer-overflow on amdgpu device 0 at pc 0x7fb1410d2cc4 WRITE of size 4 in workgroup id (10,0,0) @@ -129,13 +123,11 @@ Shadow byte legend (one shadow byte represents 8 application bytes): Heap left redzone: fa ... ==3141==ABORTING - ``` Running with `m = 100`, `n1 = 10`, `n2 = 10` and `c = 99` should produce a report for an invalid copy. -``` - +```gdb ================================================================= ==2817==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x514000150dcc at pc 0x7f5509551aca bp 0x7ffc90a7ae50 sp 0x7ffc90a7a610 WRITE of size 400 at 0x514000150dcc thread T0 @@ -167,5 +159,4 @@ Shadow byte legend (one shadow byte represents 8 application bytes): Heap left redzone: fa ... ==2817==ABORTING - ``` diff --git a/Libraries/hipBLAS/README.md b/Libraries/hipBLAS/README.md index 26c1bb39..330d5863 100644 --- a/Libraries/hipBLAS/README.md +++ b/Libraries/hipBLAS/README.md @@ -1,30 +1,36 @@ # hipBLAS Examples ## Summary + The examples in this subdirectory showcase the functionality of the [hipBLAS](https://github.com/ROCmSoftwarePlatform/hipBLAS) library. The examples build on both Linux and Windows for the ROCm (AMD GPU) backend. ## Prerequisites + ### Linux + - [CMake](https://cmake.org/download/) (at least version 3.21) - OR GNU Make - available via the distribution's package manager - [ROCm](https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.2/page/Overview_of_ROCm_Installation_Methods.html) (at least version 5.x.x) - [hipBLAS](https://github.com/ROCmSoftwarePlatform/hipBLAS): `hipblas` package available from [repo.radeon.com](https://repo.radeon.com/rocm/). - ### Windows + - [Visual Studio](https://visualstudio.microsoft.com/) 2019 or 2022 with the "Desktop Development with C++" workload - ROCm toolchain for Windows (No public release yet) - The Visual Studio ROCm extension needs to be installed to build with the solution files. - [hipBLAS](https://github.com/ROCmSoftwarePlatform/hipBLAS) - - Installed as part of the ROCm SDK on Windows for ROCm platform. + - Installed as part of the ROCm SDK on Windows for ROCm platform. - [CMake](https://cmake.org/download/) (optional, to build with CMake. Requires at least version 3.21) - [Ninja](https://ninja-build.org/) (optional, to build with CMake) ## Building + ### Linux + Make sure that the dependencies are installed, or use the [provided Dockerfiles](../../Dockerfiles/) to build and run the examples in a containerized environment that has all prerequisites installed. #### Using CMake + All examples in the `hipBLAS` subdirectory can either be built by a single CMake project or be built independently. - `$ cd Libraries/hipBLAS` @@ -32,16 +38,20 @@ All examples in the `hipBLAS` subdirectory can either be built by a single CMake - `$ cmake --build build` #### Using Make + All examples can be built by a single invocation to Make or be built independently. - `$ cd Libraries/hipBLAS` - `$ make` ### Windows + #### Visual Studio + Visual Studio solution files are available for the individual examples. To build all examples for hipBLAS open the top level solution file [ROCm-Examples-VS2019.sln](../../ROCm-Examples-VS2019.sln) and filter for hipBLAS. For more detailed build instructions refer to the top level [README.md](../../README.md#visual-studio). #### CMake + All examples in the `hipBLAS` subdirectory can either be built by a single CMake project or be built independently. For build instructions refer to the top-level [README.md](../../README.md#cmake-2). diff --git a/Libraries/hipBLAS/gemm_strided_batched/README.md b/Libraries/hipBLAS/gemm_strided_batched/README.md index b7bdc8ee..178fe599 100644 --- a/Libraries/hipBLAS/gemm_strided_batched/README.md +++ b/Libraries/hipBLAS/gemm_strided_batched/README.md @@ -1,11 +1,13 @@ # hipBLAS Level 3 Generalized Matrix Multiplication Strided Batched Example ## Description + This example illustrates the use of the hipBLAS Level 3 Strided Batched General Matrix Multiplication. The hipBLAS GEMM STRIDED BATCHED performs a matrix--matrix operation for a _batch_ of matrices as: $C[i] = \alpha \cdot A[i]' \cdot B[i]' + \beta \cdot (C[i])$ for each $i \in [0, batch - 1]$, where $X[i] = X + i \cdot strideX$ is the $i$-th element of the correspondent batch and $X'$ is one of the following: + - $X' = X$ or - $X' = X^T$ (transpose $X$: $X_{ij}^T = X_{ji}$) or - $X' = X^H$ (Hermitian $X$: $X_{ij}^H = \bar X_{ji} $). @@ -14,8 +16,8 @@ In this example the identity is used. $\alpha$ and $\beta$ are scalars, and $A$, $B$ and $C$ are the batches of matrices. For each $i$, $A[i]$, $B[i]$ and $C[i]$ are matrices such that $A_i'$ is an $m \times k$ matrix, $B_i'$ a $k \times n$ matrix and $C_i$ an $m \times n$ matrix. - ### Application flow + 1. Read in command-line parameters. 2. Set dimension variables of the matrices and get the batch count. 3. Allocate and initialize the host matrices. Set up $B$ matrix as an identity matrix. @@ -30,7 +32,9 @@ $A_i'$ is an $m \times k$ matrix, $B_i'$ a $k \times n$ matrix and $C_i$ an $m \ 12. Validate the output by comparing it to the CPU reference result. ### Command line interface + The application provides the following optional command line arguments: + - `-a` or `--alpha`. The scalar value $\alpha$ used in the GEMM operation. Its default value is 1. - `-b` or `--beta`. The scalar value $\beta$ used in the GEMM operation. Its default value is 1. - `-c` or `--count`. Batch count. Its default value is 3. @@ -39,49 +43,53 @@ The application provides the following optional command line arguments: - `-k` or `--k`. The number of columns of matrix $A$ and rows of matrix $B$, which must be greater than 0. Its default value is 5. ## Key APIs and Concepts + - The performance of a numerical multi-linear algebra code can be heavily increased by using tensor contractions [ [Y. Shi et al., HiPC, pp 193, 2016.](https://doi.org/10.1109/HiPC.2016.031) ], thereby most of the hipBLAS functions have a`_batched` and a `_strided_batched` [ [C. Jhurani and P. Mullowney, JPDP Vol 75, pp 133, 2015.](https://doi.org/10.1016/j.jpdc.2014.09.003) ] extensions.
We can apply the same multiplication operator for several matrices if we combine them into batched matrices. Batched matrix multiplication has a performance improvement for a large number of small matrices. For a constant stride between matrices, further acceleration is available by strided batched GEMM. - hipBLAS is initialized by calling `hipblasCreate(hipblasHandle*)` and it is terminated by calling `hipblasDestroy(hipblasHandle)`. - The _pointer mode_ controls whether scalar parameters must be allocated on the host (`HIPBLAS_POINTER_MODE_HOST`) or on the device (`HIPBLAS_POINTER_MODE_DEVICE`). It is controlled by `hipblasSetPointerMode`. - The symbol $X'$ denotes the following operations, as defined in the Description section: - - `HIPBLAS_OP_N`: identity operator ($X' = X$), - - `HIPBLAS_OP_T`: transpose operator ($X' = X^T$) or - - `HIPBLAS_OP_C`: Hermitian (conjugate transpose) operator ($X' = X^H$). + + - `HIPBLAS_OP_N`: identity operator ($X' = X$), + - `HIPBLAS_OP_T`: transpose operator ($X' = X^T$) or + - `HIPBLAS_OP_C`: Hermitian (conjugate transpose) operator ($X' = X^H$). - `hipblasStride` strides between matrices or vectors in strided_batched functions. - `hipblas[HSDCZ]gemmStridedBatched` Depending on the character matched in `[HSDCZ]`, the norm can be obtained with different precisions: - - `H`(half-precision: `hipblasHalf`) - - `S` (single-precision: `float`) - - `D` (double-precision: `double`) - - `C` (single-precision complex: `hipblasComplex`) - - `Z` (double-precision complex: `hipblasDoubleComplex`). - - Input parameters for `hipblasSgemmStridedBatched`: - - `hipblasHandle_t handle` - - `hipblasOperation_t trans_a`: transformation operator on each $A_i$ matrix - - `hipblasOperation_t trans_b`: transformation operator on each $B_i$ matrix - - `int m`: number of rows in each $A_i'$ and $C$ matrices - - `int n`: number of columns in each $B_i'$ and $C$ matrices - - `int k`: number of columns in each $A_i'$ matrix and number of rows in each $B_i'$ matrix - - `const float *alpha`: scalar multiplier of each $C_i$ matrix addition - - `const float *A`: pointer to the each $A_i$ matrix - - `int lda`: leading dimension of each $A_i$ matrix - - `long long stride_a`: stride size for each $A_i$ matrix - - `const float *B`: pointer to each $B_i$ matrix - - `int ldb`: leading dimension of each $B_i$ matrix - - `const float *beta`: scalar multiplier of the $B \cdot C$ matrix product - - `long long stride_b`: stride size for each $B_i$ matrix - - `float *C`: pointer to each $C_i$ matrix - - `int ldc`: leading dimension of each $C_i$ matrix - - `long long stride_c`: stride size for each $C_i$ matrix - - `int batch_count`: number of matrices - - Return value: `hipblasStatus_t ` + - `H`(half-precision: `hipblasHalf`) + - `S` (single-precision: `float`) + - `D` (double-precision: `double`) + - `C` (single-precision complex: `hipblasComplex`) + - `Z` (double-precision complex: `hipblasDoubleComplex`). + + Input parameters for `hipblasSgemmStridedBatched`: + + - `hipblasHandle_t handle` + - `hipblasOperation_t trans_a`: transformation operator on each $A_i$ matrix + - `hipblasOperation_t trans_b`: transformation operator on each $B_i$ matrix + - `int m`: number of rows in each $A_i'$ and $C$ matrices + - `int n`: number of columns in each $B_i'$ and $C$ matrices + - `int k`: number of columns in each $A_i'$ matrix and number of rows in each $B_i'$ matrix + - `const float *alpha`: scalar multiplier of each $C_i$ matrix addition + - `const float *A`: pointer to the each $A_i$ matrix + - `int lda`: leading dimension of each $A_i$ matrix + - `long long stride_a`: stride size for each $A_i$ matrix + - `const float *B`: pointer to each $B_i$ matrix + - `int ldb`: leading dimension of each $B_i$ matrix + - `const float *beta`: scalar multiplier of the $B \cdot C$ matrix product + - `long long stride_b`: stride size for each $B_i$ matrix + - `float *C`: pointer to each $C_i$ matrix + - `int ldc`: leading dimension of each $C_i$ matrix + - `long long stride_c`: stride size for each $C_i$ matrix + - `int batch_count`: number of matrices + + Return value: `hipblasStatus_t` ## Demonstrated API Calls ### hipBLAS + - `hipblasCreate` - `hipblasDestroy` - `hipblasHandle_t` @@ -93,6 +101,7 @@ We can apply the same multiplication operator for several matrices if we combine - `HIPBLAS_POINTER_MODE_HOST` ### HIP runtime + - `hipFree` - `hipMalloc` - `hipMemcpy` diff --git a/Libraries/hipBLAS/her/README.md b/Libraries/hipBLAS/her/README.md index 99ef1fff..986c8a17 100644 --- a/Libraries/hipBLAS/her/README.md +++ b/Libraries/hipBLAS/her/README.md @@ -1,6 +1,7 @@ # hipBLAS Level 2 Hermitian Rank-2 Update Example ## Description + This example showcases the usage of the hipBLAS Level2 Hermitian rank-2 update functionality. The hipBLAS HER2 function performs a Hermitian rank-2 update operation, which is defined as follows: $A = A + \alpha\cdot x\cdot y^H + \bar\alpha \cdot y \cdot x^H$, @@ -8,36 +9,40 @@ $A = A + \alpha\cdot x\cdot y^H + \bar\alpha \cdot y \cdot x^H$, where $A$ is an $n \times n$ Hermitian complex matrix, $x$ and $y$ are complex vectors of $n$ elements, $\alpha$ is a complex scalar and $v^H$ is the _Hermitian transpose_ of a vector $v \in \mathbb{C}^n$. ### Application flow + 1. Read in command-line parameters. 2. Allocate and initialize the host vectors and matrix. 3. Compute CPU reference result. 4. Create a hipBLAS handle. 5. Allocate and initialize the device vectors and matrix. 6. Copy input vectors and matrix from host to device. -6. Invoke the hipBLAS HER2 function. -7. Copy the result from device to host. -8. Destroy the hipBLAS handle and release device memory. -9. Validate the output by comparing it to the CPU reference result. +7. Invoke the hipBLAS HER2 function. +8. Copy the result from device to host. +9. Destroy the hipBLAS handle and release device memory. +10. Validate the output by comparing it to the CPU reference result. ### Command line interface + The application provides the following optional command line arguments: + - `-a` or `--alpha`. The scalar value $\alpha$ used in the HER2 operation. Its default value is 1. - `-x` or `--incx`. The stride between consecutive values in the data array that makes up vector $x$, which must be greater than 0. Its default value is 1. - `-y` or `--incy`. The stride between consecutive values in the data array that makes up vector $y$, which must be greater than 0. Its default value is 1. - `-n` or `--n`. The dimension of matrix $A$ and vectors $x$ and $y$, which must be greater than 0. Its default value is 5. ## Key APIs and Concepts + - hipBLAS is initialized by calling `hipblasCreate(hipblasHandle_t*)` and it is terminated by calling `hipblasDestroy(hipblasHandle_t)`. - The _pointer mode_ controls whether scalar parameters must be allocated on the host (`HIPBLAS_POINTER_MODE_HOST`) or on the device (`HIPBLAS_POINTER_MODE_DEVICE`). It is controlled by `hipblasSetPointerMode`. - `hipblasSetVector` and `hipblasSetMatrix` are helper functions provided by the hipBLAS API for writing data to the GPU, whereas `hipblasGetVector` and `hipblasGetMatrix` are intended for retrieving data from the GPU. Note that `hipMemcpy` can also be used to copy/get data to/from the GPU in the usual way. - `hipblas[CZ]her2(handle, uplo, n, *alpha, *x, incx, *y, incy, *AP, lda)` computes a Hermitian rank-2 update. The character matched in `[CZ]` denotes the data type of the operation, and can be either `C` (complex float: `hipblasComplex`), or `Z` (complex double: `hipblasDoubleComplex`). The required arguments come as the following: - - `handle`, the hipBLAS API handle. - - `uplo`. Because a Hermitian matrix is symmetric over the diagonal, except that the values in the upper triangle are the complex conjugate of the values in the lower triangle, the required work can be reduced by only updating a single half of the matrix. The part of the matrix to update is given by `uplo`: `HIPBLAS_FILL_MODE_UPPER` (used in this example) indicates that the upper triangle of $A$ should be updated, `HIPBLAS_FILL_MODE_LOWER` indicates that the lower triangle of $A$ should be updated and `HIPBLAS_FILL_MODE_FULL` indicates that the full matrix will be updated. - - `n` gives the dimensions of the vector and matrix inputs. - - `alpha` is the complex scalar. - - `x` and `y` are the input vectors, and `incx` and `incy` are the increments in elements between items of $x$ and $y$, respectively. - - `AP` is the device pointer to matrix $A$ in device memory. - - `lda` is the _leading dimension_ of $A$, that is, the number of elements between the starts of the columns of $A$. Note that hipBLAS matrices are laid out in _column major_ ordering. + - `handle`, the hipBLAS API handle. + - `uplo`. Because a Hermitian matrix is symmetric over the diagonal, except that the values in the upper triangle are the complex conjugate of the values in the lower triangle, the required work can be reduced by only updating a single half of the matrix. The part of the matrix to update is given by `uplo`: `HIPBLAS_FILL_MODE_UPPER` (used in this example) indicates that the upper triangle of $A$ should be updated, `HIPBLAS_FILL_MODE_LOWER` indicates that the lower triangle of $A$ should be updated and `HIPBLAS_FILL_MODE_FULL` indicates that the full matrix will be updated. + - `n` gives the dimensions of the vector and matrix inputs. + - `alpha` is the complex scalar. + - `x` and `y` are the input vectors, and `incx` and `incy` are the increments in elements between items of $x$ and $y$, respectively. + - `AP` is the device pointer to matrix $A$ in device memory. + - `lda` is the _leading dimension_ of $A$, that is, the number of elements between the starts of the columns of $A$. Note that hipBLAS matrices are laid out in _column major_ ordering. - If `ROCM_MATHLIBS_API_USE_HIP_COMPLEX` is defined (adding `#define ROCM_MATHLIBS_API_USE_HIP_COMPLEX` before `#include `), the hipBLAS API is exposed as using the hip defined complex types. That is, `hipblasComplex` is a typedef of `hipFloatComplex` (also named `hipComplex`) and they can be used equivalently. - `hipFloatComplex` and `std::complex` have compatible memory layout, and performing a memory copy between values of these types will correctly perform the expected copy. @@ -46,6 +51,7 @@ The application provides the following optional command line arguments: ## Demonstrated API Calls ### hipBLAS + - `HIPBLAS_FILL_MODE_UPPER` - `HIPBLAS_POINTER_MODE_HOST` - `hipblasCher2` @@ -59,6 +65,7 @@ The application provides the following optional command line arguments: - `hipblasSetVector` ### HIP runtime + - `ROCM_MATHLIBS_API_USE_HIP_COMPLEX` - `hipCaddf` - `hipFloatComplex` diff --git a/Libraries/hipBLAS/scal/README.md b/Libraries/hipBLAS/scal/README.md index 4ed540db..401a643f 100644 --- a/Libraries/hipBLAS/scal/README.md +++ b/Libraries/hipBLAS/scal/README.md @@ -1,9 +1,11 @@ # hipBLAS Level 1 Scal Example ## Description + This example showcases the usage of hipBLAS' Level 1 SCAL function. The Level 1 API defines operations between vector and vector. SCAL is a scaling operator for an $x$ vector defined as $x_i := \alpha \cdot x_i$. -### Application flow +### Application flow + 1. Read in and parse command line parameters. 2. Allocate and initialize host vector. 3. Compute CPU reference result. @@ -13,26 +15,30 @@ This example showcases the usage of hipBLAS' Level 1 SCAL function. The Level 1 7. Call hipBLAS' SCAL function. 8. Copy the result from device to host. 9. Destroy the hipBLAS handle, release device memory. -10. Validate the output by comparing it to the CPU reference result. +10. Validate the output by comparing it to the CPU reference result. ### Command line interface + The application provides the following optional command line arguments: + - `-a` or `--alpha`. The scalar value $a$ used in the SCAL operation. Its default value is 3. - `-x` or `--incx`. The stride between consecutive values in the data array that makes up vector $x$, must be greater than zero. Its default value is 1. - `-n` or `--n`. The number of elements in vector $x$, must be greater than zero. Its default value is 5. ## Key APIs and Concepts + - hipBLAS is initialized by calling `hipblasCreate(hipblasHandle_t *handle)` and it is terminated by calling `hipblasDestroy(hipblasHandle_t handle)`. - The _pointer mode_ controls whether scalar parameters must be allocated on the host (`HIPBLAS_POINTER_MODE_HOST`) or on the device (`HIPBLAS_POINTER_MODE_DEVICE`). It is controlled by `hipblasSetPointerMode`. - `hipblas[SDCZ]scal` multiplies each element of the vector by a scalar. Depending on the character matched in `[SDCZ]`, the scaling can be obtained with different precisions: - - `S` (single-precision: `float`) - - `D` (double-precision: `double`) - - `C` (single-precision complex: `hipblasComplex`) - - `Z` (double-precision complex: `hipblasDoubleComplex`). + - `S` (single-precision: `float`) + - `D` (double-precision: `double`) + - `C` (single-precision complex: `hipblasComplex`) + - `Z` (double-precision complex: `hipblasDoubleComplex`). ## Demonstrated API Calls ### hipBLAS + - `hipblasCreate` - `hipblasDestroy` - `hipblasHandle_t` @@ -41,6 +47,7 @@ The application provides the following optional command line arguments: - `hipblasSscal` ### HIP runtime + - `hipFree` - `hipMalloc` - `hipMemcpy` diff --git a/Libraries/hipCUB/README.md b/Libraries/hipCUB/README.md index e06285ec..2f01caa2 100644 --- a/Libraries/hipCUB/README.md +++ b/Libraries/hipCUB/README.md @@ -1,34 +1,41 @@ # hipCUB Examples ## Summary + The examples in this subdirectory showcase the functionality of the [hipCUB](https://github.com/ROCmSoftwarePlatform/hipCUB) library. The examples build on both Linux and Windows for both the ROCm (AMD GPU) and CUDA (NVIDIA GPU) backend. ## Prerequisites + ### Linux + - [CMake](https://cmake.org/download/) (at least version 3.21) - OR GNU Make - available via the distribution's package manager - [ROCm](https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.1.3/page/Overview_of_ROCm_Installation_Methods.html) (at least version 5.x.x) - [hipCUB](https://github.com/ROCmSoftwarePlatform/hipCUB) - - ROCm platform: `hipCUB-dev` package available from [repo.radeon.com](https://repo.radeon.com/rocm/). The repository is added during the standard ROCm [install procedure](https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.1.3/page/How_to_Install_ROCm.html). - - CUDA platform: Install hipCUB from source: [instructions](https://github.com/ROCmSoftwarePlatform/hipCUB#build-and-install). - - [CUB](https://github.com/NVIDIA/cub) is a dependency of hipCUB for NVIDIA platforms. CUB is part of the NVIDIA CUDA Toolkit. + - ROCm platform: `hipCUB-dev` package available from [repo.radeon.com](https://repo.radeon.com/rocm/). The repository is added during the standard ROCm [install procedure](https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.1.3/page/How_to_Install_ROCm.html). + - CUDA platform: Install hipCUB from source: [instructions](https://github.com/ROCmSoftwarePlatform/hipCUB#build-and-install). + - [CUB](https://github.com/NVIDIA/cub) is a dependency of hipCUB for NVIDIA platforms. CUB is part of the NVIDIA CUDA Toolkit. ### Windows + - [Visual Studio](https://visualstudio.microsoft.com/) 2019 or 2022 with the "Desktop Development with C++" workload - ROCm toolchain for Windows (No public release yet) - - The Visual Studio ROCm extension needs to be installed to build with the solution files. + - The Visual Studio ROCm extension needs to be installed to build with the solution files. - [hipCUB](https://github.com/ROCmSoftwarePlatform/hipCUB) - - ROCm platform: Installed as part of the ROCm SDK on Windows for ROCm platform. - - CUDA platform: Install hipCUB from source: [instructions](https://github.com/ROCmSoftwarePlatform/hipCUB#build-and-install). - - [CUB](https://github.com/NVIDIA/cub) is a dependency of hipCUB for NVIDIA platforms. CUB is part of the NVIDIA CUDA Toolkit. + - ROCm platform: Installed as part of the ROCm SDK on Windows for ROCm platform. + - CUDA platform: Install hipCUB from source: [instructions](https://github.com/ROCmSoftwarePlatform/hipCUB#build-and-install). + - [CUB](https://github.com/NVIDIA/cub) is a dependency of hipCUB for NVIDIA platforms. CUB is part of the NVIDIA CUDA Toolkit. - [CMake](https://cmake.org/download/) (optional, to build with CMake. Requires at least version 3.21) - [Ninja](https://ninja-build.org/) (optional, to build with CMake) ## Building + ### Linux + Make sure that the dependencies are installed, or use one of the [provided Dockerfiles](../../Dockerfiles/) to build and run the examples in a containerized environment. #### Using CMake + All examples in the `hipCUB` subdirectory can either be built by a single CMake project or be built independently. - `$ cd Libraries/hipCUB` @@ -36,16 +43,20 @@ All examples in the `hipCUB` subdirectory can either be built by a single CMake - `$ cmake --build build` #### Using Make + All examples can be built by a single invocation to Make or be built independently. - `$ cd Libraries/hipCUB` - `$ make` (on ROCm) or `$ make GPU_RUNTIME=CUDA` (on CUDA) ### Windows + #### Visual Studio + Visual Studio solution files are available for the individual examples. To build all examples for hipCUB open the top level solution file [ROCm-Examples-VS2019.sln](../../ROCm-Examples-VS2019.sln) and filter for hipCUB. For more detailed build instructions refer to the top level [README.md](../../README.md#visual-studio). #### CMake + All examples in the `hipCUB` subdirectory can either be built by a single CMake project or be built independently. For build instructions refer to the top-level [README.md](../../README.md#cmake-2). diff --git a/Libraries/hipCUB/device_radix_sort/README.md b/Libraries/hipCUB/device_radix_sort/README.md index 9106e784..0e7a3609 100644 --- a/Libraries/hipCUB/device_radix_sort/README.md +++ b/Libraries/hipCUB/device_radix_sort/README.md @@ -1,9 +1,11 @@ # hipCUB Device Radix Sort Example ## Description + This simple program showcases the usage of the `hipcub::DeviceRadixSort::SortPairs` function. -### Application flow +### Application flow + 1. Host side data is instantiated in `std::vector` and `std::vector` as key-value pairs. 2. Device side storage is allocated using `hipMalloc`. 3. Data is copied from host to device using `hipMemcpy`. @@ -14,16 +16,19 @@ This simple program showcases the usage of the `hipcub::DeviceRadixSort::SortPai 8. Free all device side memory using `hipFree` ## Key APIs and Concepts + - The device-level API provided by hipCUB is used in this example. It performs global device level operations (in this case pair sorting using `hipcub::DeviceRadixSort::SortPairs`) on the GPU. ## Demonstrated API Calls + ### hipCUB + - `hipcub::DoubleBuffer` - `hipcub::DeviceRadixSort::SortPairs` ### HIP runtime + - `hipGetErrorString` - `hipMalloc` - `hipMemcpy` - `hipFree` - diff --git a/Libraries/hipCUB/device_sum/README.md b/Libraries/hipCUB/device_sum/README.md index efc3acf7..9a686d43 100644 --- a/Libraries/hipCUB/device_sum/README.md +++ b/Libraries/hipCUB/device_sum/README.md @@ -1,9 +1,11 @@ # hipCUB Device Sum Example ## Description + This simple program showcases the usage of the `hipcub::DeviceReduce::Sum()`. -### Application flow +### Application flow + 1. Host side data is instantiated in a `std::vector`. 2. Device side storage is allocated using `hipMalloc`. 3. Data is copied from host to device using `hipMemcpy`. @@ -14,13 +16,17 @@ This simple program showcases the usage of the `hipcub::DeviceReduce::Sum()`. 8. Free any device side memory using `hipFree` ## Key APIs and Concepts + - hipCUB provided device level API is used in this example. It performs global device level operations (in this case a sum reduction using `hipcub::DeviceReduce::Sum`) on the GPU. ## Demonstrated API Calls + ### hipCUB + - `hipcub::DeviceReduce::Sum` ### HIP runtime + - `hipGetErrorString` - `hipMalloc` - `hipMemcpy` diff --git a/Libraries/hipSOLVER/README.md b/Libraries/hipSOLVER/README.md index 3d02d2b0..7ffc7fb1 100644 --- a/Libraries/hipSOLVER/README.md +++ b/Libraries/hipSOLVER/README.md @@ -1,30 +1,36 @@ # hipSOLVER Examples ## Summary + The examples in this subdirectory showcase the functionality of the [hipSOLVER](https://github.com/ROCmSoftwarePlatform/hipSOLVER) library. The examples build on both Linux and Windows for the ROCm (AMD GPU) backend. ## Prerequisites + ### Linux + - [CMake](https://cmake.org/download/) (at least version 3.21) - OR GNU Make - available via the distribution's package manager - [ROCm](https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.2/page/Overview_of_ROCm_Installation_Methods.html) (at least version 5.x.x) - [hipSOLVER](https://github.com/ROCmSoftwarePlatform/hipSOLVER): `hipsolver` package available from [repo.radeon.com](https://repo.radeon.com/rocm/). The repository is added during the standard ROCm [install procedure](https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.2/page/How_to_Install_ROCm.html). - ### Windows + - [Visual Studio](https://visualstudio.microsoft.com/) 2019 or 2022 with the "Desktop Development with C++" workload - ROCm toolchain for Windows (No public release yet) - The Visual Studio ROCm extension needs to be installed to build with the solution files. - [hipSOLVER](https://github.com/ROCmSoftwarePlatform/hipSOLVER) - - Installed as part of the ROCm SDK on Windows for ROCm platform. + - Installed as part of the ROCm SDK on Windows for ROCm platform. - [CMake](https://cmake.org/download/) (optional, to build with CMake. Requires at least version 3.21) - [Ninja](https://ninja-build.org/) (optional, to build with CMake) ## Building + ### Linux + Make sure that the dependencies are installed, or use the [provided Dockerfiles](../../Dockerfiles/) to build and run the examples in a containerized environment that has all prerequisites installed. #### Using CMake + All examples in the `hipSOLVER` subdirectory can either be built by a single CMake project or be built independently. - `$ cd Libraries/hipSOLVER` @@ -32,16 +38,20 @@ All examples in the `hipSOLVER` subdirectory can either be built by a single CMa - `$ cmake --build build` #### Using Make + All examples can be built by a single invocation to Make or be built independently. - `$ cd Libraries/hipSOLVER` - `$ make` ### Windows + #### Visual Studio + Visual Studio solution files are available for the individual examples. To build all examples for hipSOLVER open the top level solution file [ROCm-Examples-VS2017.sln](../../ROCm-Examples-VS2017.sln), [ROCm-Examples-VS2019.sln](../../ROCm-Examples-VS2019.sln) or [ROCm-Examples-VS2022.sln](../../ROCm-Examples-VS2022.sln) (for Visual Studio 2017, 2019 or 2022, respectively) and filter for hipSOLVER. For more detailed build instructions refer to the top level [README.md](../../README.md#visual-studio). #### CMake + All examples in the `hipSOLVER` subdirectory can either be built by a single CMake project or be built independently. For build instructions refer to the top-level [README.md](../../README.md#cmake-2). diff --git a/Libraries/hipSOLVER/gels/README.md b/Libraries/hipSOLVER/gels/README.md index 3da96fe1..685b8caa 100644 --- a/Libraries/hipSOLVER/gels/README.md +++ b/Libraries/hipSOLVER/gels/README.md @@ -1,6 +1,7 @@ # hipSOLVER linear least-squares ## Description + This example illustrates the use of hipSOLVER's linear least-squares solver, `gels`. The `gels` functions solve an overdetermined (or underdetermined) linear system defined by an $m$-by-$n$ matrix $A$, and a corresponding matrix $B$, using the QR factorization computed by `geqrf` (or the LQ factorization computed by `gelqf`). The problem solved by this function is of the form $A\times X=B$. If $m\geq n$, the system is overdetermined and a least-squares solution approximating $X$ is found by minimizing $||B−A\times X||$ (or $||B−A^\prime\times X||$). If $m \text{ \textless}\ n$, the system is underdetermined and a unique solution for X is chosen such that $||X||$ is minimal. @@ -8,6 +9,7 @@ If $m\geq n$, the system is overdetermined and a least-squares solution approxim This example shows how $A\times X = B$ is solved for $X$, where $X$ is an $m$-by-$1$ matrix. The result is validated by calculating $A\times X$ for the found result, and comparing that with $B$. ### Application flow + 1. Parse the user inputs, declare several constants for the sizes of the matrices. 2. Allocate the input- and output matrices on the host and device, initialize the input data. 3. Create a hipSOLVER handle. @@ -19,23 +21,33 @@ This example shows how $A\times X = B$ is solved for $X$, where $X$ is an $m$-by 9. Validate that the result found is correct by calculating $A\times X$, and print the result. ### Command line interface + The application provides the following optional command line arguments: + - `--n `. Number of rows of input matrix $A$, the default value is `3`. - `--m `. Number of columns of input matrix $A$, the default value is `2`. ## Key APIs and Concepts + ### hipSOLVER + - hipSOLVER is initialized by calling `hipsolverCreate(hipsolverHandle_t*)` and it is terminated by calling `hipsolverDestroy(hipsolverHandle_t)`. + - `hipsolver(SS|DD|CC|ZZ)gels` solves the system of linear equations defined by $A\times X=B$, where $A$ is an `m`-by-`n` matrix, $X$ is an `n`-by-`nrhs` matrix, and $B$ is an `m`-by-`nrhs` matrix. Depending on the character matched in `(SS|DD|CC|ZZ)`, the solution can be obtained with different precisions: - - `S` (single-precision: `float`). - - `D` (double-precision: `double`). - - `C` (single-precision complex: `hipFloatComplex`). - - `Z` (double-precision complex: `hipDoubleComplex`). - + + - `S` (single-precision: `float`). + - `D` (double-precision: `double`). + - `C` (single-precision complex: `hipFloatComplex`). + - `Z` (double-precision complex: `hipDoubleComplex`). + The `gels` function also requires the specification of the _leading dimension_ of all matrices. The leading dimension specifies the number of elements between the beginnings of successive matrix vectors. In other fields, this may be referred to as the _stride_. This concept allows the matrix used in the `gels` function to be a sub-matrix of a larger one. Since hipSOLVER matrices are stored in column-major order, the leading dimension must be greater than or equal to the number of rows of the matrix. + - `hipsolver(SS|DD|CC|ZZ)gels_bufferSize` allows to obtain the size needed for the working space for the `hipsolver(SS|DD|CC|ZZ)gels` function. + ## Used API surface + ### hipSOLVER + - `hipsolverDDgels` - `hipsolverDDgels_bufferSize` - `hipsolverHandle_t` @@ -43,6 +55,7 @@ The application provides the following optional command line arguments: - `hipsolverDestroy` ### HIP runtime + - `hipFree` - `hipMalloc` - `hipMemcpy` diff --git a/Libraries/hipSOLVER/geqrf/README.md b/Libraries/hipSOLVER/geqrf/README.md index ea6a4006..5bd542b9 100644 --- a/Libraries/hipSOLVER/geqrf/README.md +++ b/Libraries/hipSOLVER/geqrf/README.md @@ -1,7 +1,9 @@ # hipSOLVER QR Factorization Example ## Description + This example illustrates the use of hipSOLVER to compute the QR factorization of a matrix $A$. The [QR factorization](https://en.wikipedia.org/wiki/QR_decomposition) of a $m \times n$ matrix $A$ computes the unitary matrix $Q$ and upper triangular matrix $R$, such that $A = QR$. The QR factorization is calculated using householder transformations. + - $Q$ is an $m \times m$ unitary matrix, i.e. $Q^{-1} = Q^H$ - $R$ is an $m \times n$ upper (or right) triangular matrix, i.e. all entries below the diagonal are zero. @@ -12,6 +14,7 @@ In the general case hipSOLVER calculates $Q_1$ and $R_1$. The calculated solution is verified by computing the root mean square of the elements in $Q^H Q - I$, which should result in a zero matrix, to check whether $Q$ is actually orthogonal. ### Application flow + 1. Declare and initialize variables for the in- and output matrix. 2. Initialize the matrix on the host. 3. Allocate device memory and copy the matrix to the device. @@ -22,78 +25,101 @@ The calculated solution is verified by computing the root mean square of the ele 8. Free device memory and handles. ## Key APIs and Concepts + ### hipSOLVER + - hipSOLVER is initialized by calling `hipsolverCreate(hipsolverHandle_t*)` and it is terminated by calling `hipsolverDestroy(hipsolverHandle_t)`. + - `hipsolver[SDCZ]geqrf` computes the QR factorization of a $m \times n$ matrix $A$. The results of $Q$ and $R$ are stored in place of $A$. The orthogonal matrix $Q$ is not explicitly calculated, it is stored using householder vectors, which can be used to explicitly calculate $Q$ with `hipsolver[SDCZ]orgqr`. Depending on the character matched in `[SDCZ]`, the QR factorization can be obtained with different precisions: - - `S` (single-precision: `float`) - - `D` (double-precision: `double`) - - `C` (single-precision complex: `hipFloatComplex`) - - `Z` (double-precision complex: `hipDoubleComplex`). - - In this example the double-precision variant `hipsolverDgeqrf` is used. - Its input parameters are: - - `hipsolverHandle_t handle` - - `int m` number of rows of $A$ - - `int n` number of columns of $A$ - - `double *A` pointer to matrix $A$ - - `int lda` leading dimension of matrix $A$ - - `double *tau` vector that stores the scaling factors for the householder vectors. - - `double *work` memory for working space used by the function - - `int lwork` size of working space - - `int *devInfo` status report of the function. The QR factorization is successful if the value pointed to by devInfo is 0. When using cuSOLVER as backend, if the value is a negative integer $-i$, then the i-th parameter of `hipsolverDgeqrf` is wrong. - The return type is `hipsolverStatus_t`. + + - `S` (single-precision: `float`) + - `D` (double-precision: `double`) + - `C` (single-precision complex: `hipFloatComplex`) + - `Z` (double-precision complex: `hipDoubleComplex`). + + In this example the double-precision variant `hipsolverDgeqrf` is used. + + Its input parameters are: + + - `hipsolverHandle_t handle` + - `int m` number of rows of $A$ + - `int n` number of columns of $A$ + - `double *A` pointer to matrix $A$ + - `int lda` leading dimension of matrix $A$ + - `double *tau` vector that stores the scaling factors for the householder vectors. + - `double *work` memory for working space used by the function + - `int lwork` size of working space + - `int *devInfo` status report of the function. The QR factorization is successful if the value pointed to by devInfo is 0. When using cuSOLVER as backend, if the value is a negative integer $-i$, then the i-th parameter of `hipsolverDgeqrf` is wrong. + + The return type is `hipsolverStatus_t`. - `hipsolver[SDCZ]geqrf_bufferSize` calculates the required size of the working space for `hipsolver[SDCZ]geqrf`. The used type has to match the actual solver function. - The input parameters for `hipsolverDgeqrf_bufferSize` are: - - `hipsolverHandle_t handle` - - `int m` number of rows of $A$ - - `int n` number of columns of $A$ - - `double *A` pointer to matrix $A$ - - `int lda` leading dimension of matrix $A$ - - `int *lwork` returns the size of the working space required - The return type is `hipsolverStatus_t`. + + The input parameters for `hipsolverDgeqrf_bufferSize` are: + + - `hipsolverHandle_t handle` + - `int m` number of rows of $A$ + - `int n` number of columns of $A$ + - `double *A` pointer to matrix $A$ + - `int lda` leading dimension of matrix $A$ + - `int *lwork` returns the size of the working space required + + The return type is `hipsolverStatus_t`. - `hipsolver[SD]orgqr` computes the orthogonal matrix $Q$ from the householder vectors, as stored in $A$, and the corresponding scaling factors as stored in tau, both as returned by `hipsolver[SD]geqrf`. - In the case of complex matrices, the function `hipsolver[CZ]ungqr` has to be used. - In this example the double-precision variant `hipsolverDorgqr` is used. - Its input parameters are: - - `hipsolverHandle_t handle` - - `int m` number of rows of matrix $Q$ - - `int n` number of columns of matrix $Q$ ($m \geq n \gt 0$) - - `int k` number of elementary reflections whose product defines the matrix $Q$ ($n \geq k \geq 0$) - - `double *A` matrix containing the householder vectors - - `int lda` leading dimension of $A$ - - `double *tau` vector that stores the scaling factors for the householder vectors - - `double *work` memory for working space used by the function - - `int lwork` size of working space - - `int *devInfo` status report of the function. The computation of $Q$ is successful if the value pointed to by devInfo is 0. When using cuSOLVER as backend, if the value is a negative integer $-i$, then the i-th parameter of `hipsolverDorgqr` is wrong. - The return type is `hipsolverStatus_t`. + + In the case of complex matrices, the function `hipsolver[CZ]ungqr` has to be used. + In this example the double-precision variant `hipsolverDorgqr` is used. + + Its input parameters are: + + - `hipsolverHandle_t handle` + - `int m` number of rows of matrix $Q$ + - `int n` number of columns of matrix $Q$ ($m \geq n \gt 0$) + - `int k` number of elementary reflections whose product defines the matrix $Q$ ($n \geq k \geq 0$) + - `double *A` matrix containing the householder vectors + - `int lda` leading dimension of $A$ + - `double *tau` vector that stores the scaling factors for the householder vectors + - `double *work` memory for working space used by the function + - `int lwork` size of working space + - `int *devInfo` status report of the function. The computation of $Q$ is successful if the value pointed to by devInfo is 0. When using cuSOLVER as backend, if the value is a negative integer $-i$, then the i-th parameter of `hipsolverDorgqr` is wrong. + + The return type is `hipsolverStatus_t`. - `hipsolver[SD]orgqr_bufferSize` calculates the required size of the working space for `hipsolver[SD]orgqr`. The used type has to match the actual solver function. - The input parameters for `hipsolverDorgqr_bufferSize` are: - - `hipsolverHandle_t handle` - - `int m` number of rows of matrix $Q$ - - `int n` number of columns of matrix $Q$ - - `int k` number of elementary reflection - - `double *A` matrix containing the householder vectors - - `int lda` leading dimension of $A$ - - `double *tau` vector that stores the scaling factors for the householder vectors - - `int *lwork` returns the size of the working space required - The return type is `hipsolverStatus_t`. + + The input parameters for `hipsolverDorgqr_bufferSize` are: + + - `hipsolverHandle_t handle` + - `int m` number of rows of matrix $Q$ + - `int n` number of columns of matrix $Q$ + - `int k` number of elementary reflection + - `double *A` matrix containing the householder vectors + - `int lda` leading dimension of $A$ + - `double *tau` vector that stores the scaling factors for the householder vectors + - `int *lwork` returns the size of the working space required + + The return type is `hipsolverStatus_t`. ### hipBLAS + hipBLAS is used to validate the solution. To verify that $Q$ is orthogonal the solution $Q^T Q - I$ is computed using `hipblasDgemm` and the root mean square of the elements of that result is calculated using `hipblasDnrm2`. `hipblasDgemm` is showcased in the [gemm_strided_batched example](/Libraries/hipBLAS/gemm_strided_batched/). `hipblasDnrm2` calculates the euclidean norm of a vector. In this example the root mean square of the elements in a matrix is calculated by pretending it to be a vector and calculating its euclidean norm, then dividing it by the number of elements in the matrix. - Its input parameters are: - - `hipblasHandle_t handle` - - `int n` number of elements in x - - `double *x` device pointer storing vector x - - `int incx` stride between consecutive elements of x - - `double *result` resulting norm + +Its input parameters are: + +- `hipblasHandle_t handle` +- `int n` number of elements in x +- `double *x` device pointer storing vector x +- `int incx` stride between consecutive elements of x +- `double *result` resulting norm - The `hipblasPointerMode_t` type controls whether scalar parameters must be allocated on the host (`HIPBLAS_POINTER_MODE_HOST`) or on the device (`HIPBLAS_POINTER_MODE_DEVICE`). It is set by using `hipblasSetPointerMode`. + ## Used API surface + ### hipSOLVER + - `hipsolverCreate` - `hipsolverDestroy` - `hipsolverDgeqrf` @@ -103,6 +129,7 @@ hipBLAS is used to validate the solution. To verify that $Q$ is orthogonal the s - `hipsolverHandle_t` ### hipBLAS + - `hipblasCreate` - `hipblasDestroy` - `hipblasDgemm` @@ -114,6 +141,7 @@ hipBLAS is used to validate the solution. To verify that $Q$ is orthogonal the s - `hipblasSetPointerMode` ### HIP runtime + - `hipFree` - `hipMalloc` - `hipMemcpy` diff --git a/Libraries/hipSOLVER/gesvd/README.md b/Libraries/hipSOLVER/gesvd/README.md index e9afce50..c7b8e431 100644 --- a/Libraries/hipSOLVER/gesvd/README.md +++ b/Libraries/hipSOLVER/gesvd/README.md @@ -1,12 +1,17 @@ # hipSOLVER Singular Value Decomposition Example ## Description + This example illustrates the use of the hipSOLVER Singular Value Decomposition functionality. The hipSOLVER `gesvd` computes the singular values and optionally the left and right singular vectors of an $m \times n$ matrix $A$. The [singular value decomposition](https://en.wikipedia.org/wiki/Singular_value_decomposition) (SVD) is then given by $A = U \cdot S \cdot V^H$, where: + - $U$ is an $m \times m$ orthonormal matrix. Its column vectors are known as _left singular vectors_ of $A$ and correspond to the eigenvectors of the Hermitian and positive semi-definite matrix $AA^H$. + - $S$ is an $m \times n$ diagonal matrix with non-negative real numbers on the diagonal, the _singular values_ of $A$, defined as the (positive) square roots of the eigenvalues of the Hermitian and positive semi-definite matrix $A^HA$. Note that we always have $rank(A)$ non-zero singular values. + - $V^H$ is the Hermitian transpose of an $n \times n$ orthonormal matrix, $V$. Its row vectors are known as the _right singular vectors_ of $A$ and are defined as the eigenvectors of the Hermitian and positive semi-definite matrix $A^HA$. ### Application flow + 1. Parse command line arguments for the dimension of the input matrix. 2. Declare and initialize a number of constants for the input and output matrices and vectors. 3. Allocate and initialize the host matrices and vectors. @@ -20,70 +25,88 @@ This example illustrates the use of the hipSOLVER Singular Value Decomposition f 11. Free device memory and the handles. ### Command line interface + The application provides the following optional command line arguments: + - `--n `. Number of rows of input matrix $A$, the default value is `3`. - `--m `. Number of columns of input matrix $A$, the default value is `2`. ## Key APIs and Concepts + ### hipSOLVER + - hipSOLVER is initialized by calling `hipsolverCreate(hipsolverHandle_t*)` and it is terminated by calling `hipsolverDestroy(hipsolverHandle_t)`. + - `hipsolver[SDCZ]gesvd` computes the singular values and optionally the left and right singular vectors of an $m \times n$ matrix $A$. The correct function signature should be chosen based on the datatype of the input matrix: - - `S` (single-precision real: `float`) - - `D` (double-precision real: `double`) - - `C` (single-precision complex: `hipFloatComplex`) - - `Z` (double-precision complex: `hipDoubleComplex`). - - In this example, a double-precision real input matrix is used, in which case the function accepts the following parameters: - - `hipsolverHandle_t handle` - - `signed char jobu` and `signed char jobv`: define how the left and right singular vectors, respectively, are calculated and stored. The following values are accepted: - - `'A'`: all columns of $U$, or rows of $V^H$, are calculated. - - `'S'`: only the singular vectors associated to the singular values of $A$ are calculated and stored as columns of $U$ or rows of $V^H$. In this case some columns of $U$ or rows of $V^H$ may be left unmodified. - - `'O'`: same as `'S'`, but the singular vectors are stored in matrix $A$, overwriting it. - - `'N'`: singular vectors are not computed. - - `int m`: number of rows of $A$ - - `int n`: number of columns of $A$ - - `double *A`: pointer to matrix $A$ - - `int lda`: leading dimension of matrix $A$ - - `double *S`: pointer to vector $S$ - - `double *U`: pointer to matrix $U$ - - `int ldu`: leading dimension of matrix $U$ - - `double *V`: pointer to matrix $V^H$ - - `int ldv`: leading dimension of matrix $V^H$ - - `double *work`: pointer to working space. - - `int lwork`: size of the working space. - - `double *rwork`: unconverged superdiagonal elements of the upper bidiagonal matrix used internally for the BDSQR algorithm. - - `int *devInfo`: convergence result of the BDSQR function. If 0, the algorithm converged, if greater than 0 then `info` elements of vector $E$ did not converge to 0. - - Return type: `hipsolverStatus_t`. + + - `S` (single-precision real: `float`) + - `D` (double-precision real: `double`) + - `C` (single-precision complex: `hipFloatComplex`) + - `Z` (double-precision complex: `hipDoubleComplex`). + + In this example, a double-precision real input matrix is used, in which case the function accepts the following parameters: + + - `hipsolverHandle_t handle` + - `signed char jobu` and `signed char jobv`: define how the left and right singular vectors, respectively, are calculated and stored. The following values are accepted: + + - `'A'`: all columns of $U$, or rows of $V^H$, are calculated. + - `'S'`: only the singular vectors associated to the singular values of $A$ are calculated and stored as columns of $U$ or rows of $V^H$. In this case some columns of $U$ or rows of $V^H$ may be left unmodified. + - `'O'`: same as `'S'`, but the singular vectors are stored in matrix $A$, overwriting it. + - `'N'`: singular vectors are not computed. + + - `int m`: number of rows of $A$ + - `int n`: number of columns of $A$ + - `double *A`: pointer to matrix $A$ + - `int lda`: leading dimension of matrix $A$ + - `double *S`: pointer to vector $S$ + - `double *U`: pointer to matrix $U$ + - `int ldu`: leading dimension of matrix $U$ + - `double *V`: pointer to matrix $V^H$ + - `int ldv`: leading dimension of matrix $V^H$ + - `double *work`: pointer to working space. + - `int lwork`: size of the working space. + - `double *rwork`: unconverged superdiagonal elements of the upper bidiagonal matrix used internally for the BDSQR algorithm. + - `int *devInfo`: convergence result of the BDSQR function. If 0, the algorithm converged, if greater than 0 then `info` elements of vector $E$ did not converge to 0. + + Return type: `hipsolverStatus_t`. + - `hipsolver[SDCZ]gesvd_bufferSize` allows to obtain the size (in bytes) needed for the working space for the `hipsolver[SDCZ]gesvd` function. The character matched in `[SDCZ]` coincides with the one in `hipsolver[SDCZ]gesvd`. - This function accepts the following input parameters: - - `hipsolverHandle_t handle` - - `signed char jobu`: defines how left singular vectors are calculated and stored - - `signed char jobv`: defines how right singular vectors are calculated and stored - - `int m`: number of rows of $A$ - - `int n`: number of columns of $A$ - - `int lwork`: size (to be computed) of the working space. + This function accepts the following input parameters: + + - `hipsolverHandle_t handle` + - `signed char jobu`: defines how left singular vectors are calculated and stored + - `signed char jobv`: defines how right singular vectors are calculated and stored + - `int m`: number of rows of $A$ + - `int n`: number of columns of $A$ + - `int lwork`: size (to be computed) of the working space. - Return type: `hipsolverStatus_t`. + Return type: `hipsolverStatus_t`. ### hipBLAS + - For validating the solution we have used the hipBLAS functions `hipblasDdgmm` and `hipblasDgemm`. `hipblasDgemm` is showcased (strided-batched and with single-precision) in the [gemm_strided_batched example](/Libraries/hipBLAS/gemm_strided_batched/). - - The `hipblas[SDCZ]dgmm` function performs a matrix--matrix operation between a diagonal matrix and a general $m \times n$ matrix. The order of the multiplication can be determined using a `hipblasSideMode_t` type parameter: - - `HIPBLAS_SIDE_RIGHT`: the operation performed is $C = A \cdot diag(x)$. - - `HIPBLAS_SIDE_LEFT`: the operation performed is $C = diag(x) \cdot A$. This is the one used in the example for computing $S \cdot V^H$. - The correct function signature should be chosen based on the datatype of the input matrices: - - `S` (single-precision real: `float`) - - `D` (double-precision real: `double`) - - `C` (single-precision complex: `hipblasComplex`) - - `Z` (double-precision complex: `hipblasDoubleComplex`). + - The `hipblas[SDCZ]dgmm` function performs a matrix--matrix operation between a diagonal matrix and a general $m \times n$ matrix. The order of the multiplication can be determined using a `hipblasSideMode_t` type parameter: + + - `HIPBLAS_SIDE_RIGHT`: the operation performed is $C = A \cdot diag(x)$. + - `HIPBLAS_SIDE_LEFT`: the operation performed is $C = diag(x) \cdot A$. This is the one used in the example for computing $S \cdot V^H$. + + The correct function signature should be chosen based on the datatype of the input matrices: + + - `S` (single-precision real: `float`) + - `D` (double-precision real: `double`) + - `C` (single-precision complex: `hipblasComplex`) + - `Z` (double-precision complex: `hipblasDoubleComplex`). + + Return type: `hipblasStatus_t`. - Return type: `hipblasStatus_t`. - The `hipblasPointerMode_t` type controls whether scalar parameters must be allocated on the host (`HIPBLAS_POINTER_MODE_HOST`) or on the device (`HIPBLAS_POINTER_MODE_DEVICE`). It is set by using `hipblasSetPointerMode`. ## Used API surface + ### hipSOLVER + - `hipsolverDgesvd` - `hipsolverDgesvd_bufferSize` - `hipsolverHandle_t` @@ -91,6 +114,7 @@ The application provides the following optional command line arguments: - `hipsolverDestroy` ### hipBLAS + - `hipblasCreate` - `hipblasDestroy` - `hipblasDdgmm` @@ -102,6 +126,7 @@ The application provides the following optional command line arguments: - `HIPBLAS_SIDE_LEFT` ### HIP runtime + - `hipFree` - `hipMalloc` - `hipMemcpy` diff --git a/Libraries/hipSOLVER/getrf/README.md b/Libraries/hipSOLVER/getrf/README.md index 5194ab7f..6ba7a68d 100644 --- a/Libraries/hipSOLVER/getrf/README.md +++ b/Libraries/hipSOLVER/getrf/README.md @@ -1,18 +1,27 @@ # hipSOLVER LU Factorization Example ## Description + This example illustrates the use of the hipSOLVER LU factorization functionality. The hipSOLVER `getrf` computes the [LU decomposition](https://en.wikipedia.org/wiki/LU_decomposition) of an $m \times n$ matrix $A$, with partial pivoting. This factorization is given by $P \cdot A = L \cdot U$, where: + - `getrf()`: This is the blocked Level-3-BLAS version of the LU factorization algorithm. An optimized internal implementation without rocBLAS calls could be executed with mid-size matrices. + - $A$ is the $m \times n$ input matrix. + - $P$ is an $m \times m$ [permutation matrix](https://en.wikipedia.org/wiki/Permutation_matrix), in this example stored as an array of row indices `vector Ipiv` of size `min(m, n)`. + - $L$ is: + - an $m \times m$ lower triangular matrix, when $m \leq n$. - an $m \times n$ lower trapezoidal matrix, when $` m > n `$. + - $U$ is: + - an $m \times n$ upper trapezoidal matrix, when $` m < n `$. - an $n \times n$ upper tridiagonal matrix, when $m \geq n$ ### Application flow + 1. Parse command line arguments for the dimension of the input matrix. 2. Declare and initialize a number of constants for the input and output matrices and vectors. 3. Allocate and initialize the host matrices and vectors. @@ -25,15 +34,18 @@ This example illustrates the use of the hipSOLVER LU factorization functionality 10. Free device memory and the hipSOLVER handle. ## Key APIs and Concepts + ### hipSOLVER + - `hipsolver[SDCZ]getrf` computes the LU factorization of an $m \times n$ input matrix $A$. The correct function signature should be chosen, based on the datatype of the input matrix: + - `S` (single-precision: `float`) - `D` (double-precision: `double`) - `C` (single-precision complex: `hipFloatComplex`) - `Z` (double-precision complex: `hipDoubleComplex`). - Input parameters for the precision used in this example (double-precision): + - `hipsolverHandle_t handle` - `const int m`: number of rows of $A$ - `const int n`: number of columns of $A$ @@ -47,16 +59,20 @@ This example illustrates the use of the hipSOLVER LU factorization functionality - `hipsolver[SDCZ]getrf_bufferSize` allows to obtain the size (in bytes) needed for the working space for the `hipsolver[SDCZ]getrf` function. The character matched in `[SDCZ]` coincides with the one in `hipsolver[SDCZ]getrf`. This function accepts the following input parameters: + - `hipsolverHandle_t handle` - `int m` number of rows of $A$ - `int n` number of columns of $A$ - `double *A` pointer to matrix $A$ - `int lda` leading dimension of matrix $A$ - `int *lwork` returns the size of the working space required - The return type is `hipsolverStatus_t`. + + The return type is `hipsolverStatus_t`. ## Used API surface + ### hipSOLVER + - `hipsolverHandle_t` - `hipsolverCreate` - `hipsolverDestroy` @@ -64,6 +80,7 @@ This example illustrates the use of the hipSOLVER LU factorization functionality - `hipsolverDgetrf` ### HIP runtime + - `hipFree` - `hipMalloc` - `hipMemcpy` diff --git a/Libraries/hipSOLVER/potrf/README.md b/Libraries/hipSOLVER/potrf/README.md index f473b24b..50c09a19 100644 --- a/Libraries/hipSOLVER/potrf/README.md +++ b/Libraries/hipSOLVER/potrf/README.md @@ -1,12 +1,13 @@ # hipSOLVER Cholesky Decomposition and linear system solver ## Description + This example illustrates the functionality to perform Cholesky decomposition, `potrf`, and to solve a linear system using the resulting Cholesky factor, `potrs`. The `potrf` functions decompose a Hermitian positive-definite matrix $A$ into $L\cdot L^H$ (or $U^H\cdot U$), where $L$ and $U$ are a lower- and upper-triangular matrix, respectively. The `potrs` functions solve a linear system $A\times X=B$ for $X$. ### Application flow + 1. Declare several constants for the sizes of the matrices. -2. Allocate the input- and output-matrices on the host and device, initialize the input data. Matrix $A_0$ is not - Hermitian positive semi-definite, matrix $A_1$ is Hermitian positive semi-definite. +2. Allocate the input- and output-matrices on the host and device, initialize the input data. Matrix $A_0$ is not Hermitian positive semi-definite, matrix $A_1$ is Hermitian positive semi-definite. 3. Create a hipSOLVER handle. 4. Query the size of the working space of the `potrf` and `potrs` functions and allocate the required amount of device memory. 5. Call the `potrf` function to decompose $A_0$ and assert that it failed since $A_0$ does not meet the requirements. @@ -17,20 +18,30 @@ This example illustrates the functionality to perform Cholesky decomposition, `p 10. Validate that the result found is correct by calculating $A_1\times X$, and print the result. ## Key APIs and Concepts + ### hipSOLVER + - hipSOLVER is initialized by calling `hipsolverCreate(hipsolverHandle_t*)` and it is terminated by calling `hipsolverDestroy(hipsolverHandle_t)`. + - `hipsolver[SDCZ]potrf` performs Cholesky decomposition on Hermitian positive semi-definite matrix $A$. The correct function signature should be chosen based on the datatype of the input matrix: - - `S` (single-precision: `float`). - - `D` (double-precision: `double`). - - `C` (single-precision complex: `hipFloatComplex`). - - `Z` (double-precision complex: `hipDoubleComplex`). + + - `S` (single-precision: `float`). + - `D` (double-precision: `double`). + - `C` (single-precision complex: `hipFloatComplex`). + - `Z` (double-precision complex: `hipDoubleComplex`). + - `hipsolver[SDCZ]potrf_bufferSize` obtains the size needed for the working space for the `hipsolver[SDCZ]potrf` function. + - `hipsolver[SDCZ]potrs` solves the system of linear equations defined by $A\times X=B$, where $A$ is a Cholesky-decomposed Hermitian positive semi-definite `n`-by-`n` matrix, $X$ is an `n`-by-`nrhs` matrix, and $B$ is an `n`-by-`nrhs` matrix. + - The `potrf` and `potrs` functions require the specification of a `hipsolverFillMode_t`, which indicates which triangular part of the matrix is processed and replaced by the functions. The legal values are `HIPSOLVER_FILL_MODE_LOWER` and `HIPSOLVER_FILL_MODE_UPPER`. + - The `potrf` and `potrs` functions also require the specification of the _leading dimension_ of all matrices. The leading dimension specifies the number of elements between the beginnings of successive matrix vectors. In other fields, this may be referred to as the _stride_. This concept allows the matrix used in the `potrf` and `potrs` functions to be a sub-matrix of a larger one. Since hipSOLVER matrices are stored in column-major order, the leading dimension must be greater than or equal to the number of rows of the matrix. ## Used API surface + ### hipSOLVER + - `HIPSOLVER_FILL_MODE_LOWER` - `hipsolverCreate` - `hipsolverDestroy` @@ -41,6 +52,7 @@ This example illustrates the functionality to perform Cholesky decomposition, `p - `hipsolverHandle_t` ### HIP runtime + - `hipFree` - `hipMalloc` - `hipMemcpy` diff --git a/Libraries/hipSOLVER/syevd/README.md b/Libraries/hipSOLVER/syevd/README.md index 202f1bb3..5305c54f 100644 --- a/Libraries/hipSOLVER/syevd/README.md +++ b/Libraries/hipSOLVER/syevd/README.md @@ -1,6 +1,7 @@ # hipSOLVER Symmetric Eigenvalue calculation (divide and conquer algorithm) ## Description + This example illustrates how to calculate the [eigenvalues](https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors) of a [symmetric](https://en.wikipedia.org/wiki/Symmetric_matrix) 3x3 matrix using hipSOLVER. The eigenvalues of a matrix are defined as such: @@ -8,11 +9,13 @@ The eigenvalues of a matrix are defined as such: $Av_i = \lambda_i v_i$ where + - $A\in\mathbb{R}^{3\times3}$ symmetric matrix, - $\lambda_i$ for $i\in\{1, 2, 3\}$ eigenvalues (in ascending order), - $v_i\in\mathbb{R}^3$ eigenvectors corresponding to the $i$-th eigenvalue. ### Application flow + 1. Instantiate a vector containing $A$'s 9 elements. 2. Allocate device memory and copy $A$'s elements to the device. 3. Allocate device memory for the outputs of the hipSOLVER function, namely for the calculated eigenvalue vector $W=[\lambda_1, \lambda_2, \lambda_3]$, and the returned `info` value. @@ -22,51 +25,64 @@ where 7. Copy the resulting eigenvalues vector to the host. Print their values and check if their values match the expected. 8. Free all allocated resources. - ## Key APIs and Concepts + ### hipSOLVER + - hipSOLVER is initialized by calling `hipsolverCreate(hipsolverHandle_t*)` and it is terminated by calling `hipsolverDestroy(hipsolverHandle_t)`. -- `hipsolver[SD]syevd` computes the eigenvalues of an $n \times n$ matrix $A$. The correct function signature should be chosen based on the datatype of the input matrix:: - - `S` (single-precision real: `float`) - - `D` (double-precision real: `double`) - - A complex version of this function is also available under the name `hipsolver[CZ]heevd`. It accepts the same parameters as `hipsolver[SD]syevd`, except that the correct function signature should be chosen based on the following data types: - - `C` (single-precision complex: `hipFloatComplex`). - - `Z` (double-precision complex: `hipDoubleComplex`). - - In this example, a double-precision real input matrix is used, in which case the function accepts the following parameters: - - `hipsolverHandle_t handle` - - `hipsolverEigMode_t jobz`: Specifies whether the eigenvectors should also be calculated besides the eigenvalues. The following values are accepted: - - `HIPSOLVER_EIG_MODE_NOVECTOR`: Calculate the eigenvalues only. - - `HIPSOLVER_EIG_MODE_VECTOR`: Calculate both the eigenvalues and the eigenvectors. The eigenvectors are calculated by a divide and conquer algorithm and are written to the memory location specified by `*A`. - - `hipSolverFillMode_t uplo`: Specifies whether the upper or lower triangle of the symmetric matrix is stored. The following values are accepted: - - `HIPSOLVER_FILL_MODE_UPPER`: The provided `*A` pointer points to the upper triangle matrix data. - - `HIPSOLVER_FILL_MODE_LOWER`: The provided `*A` pointer points to the lower triangle matrix data. - - `int n`: Number of rows and columns of $A$. - - `double *A`: Pointer to matrix $A$ in device memory. - - `int lda`: Leading dimension of matrix $A$. - - `double *D`: Pointer to array $W$, where the resulting eigenvalues are written. - - `double *work`: Pointer to working space in device memory. - - `int lwork`: Size of the working space. - - `int *devInfo`: Convergence result of the function in device memory. If 0, the algorithm converged, if greater than 0 then `devInfo` elements of the intermediate tridiagonal matrix did not converge to 0. Also, for CUDA backend, if `devInfo = -i` for $0 < i \leq n$, then the the $i^{th}$ parameter is wrong (not counting the handle). - - Return type: `hipsolverStatus_t`. + +- `hipsolver[SD]syevd` computes the eigenvalues of an $n \times n$ matrix $A$. The correct function signature should be chosen based on the datatype of the input matrix: + + - `S` (single-precision real: `float`) + - `D` (double-precision real: `double`) + + A complex version of this function is also available under the name `hipsolver[CZ]heevd`. It accepts the same parameters as `hipsolver[SD]syevd`, except that the correct function signature should be chosen based on the following data types: + + - `C` (single-precision complex: `hipFloatComplex`). + - `Z` (double-precision complex: `hipDoubleComplex`). + + In this example, a double-precision real input matrix is used, in which case the function accepts the following parameters: + + - `hipsolverHandle_t handle` + - `hipsolverEigMode_t jobz`: Specifies whether the eigenvectors should also be calculated besides the eigenvalues. The following values are accepted: + + - `HIPSOLVER_EIG_MODE_NOVECTOR`: Calculate the eigenvalues only. + - `HIPSOLVER_EIG_MODE_VECTOR`: Calculate both the eigenvalues and the eigenvectors. The eigenvectors are calculated by a divide and conquer algorithm and are written to the memory location specified by `*A`. + + - `hipSolverFillMode_t uplo`: Specifies whether the upper or lower triangle of the symmetric matrix is stored. The following values are accepted: + + - `HIPSOLVER_FILL_MODE_UPPER`: The provided `*A` pointer points to the upper triangle matrix data. + - `HIPSOLVER_FILL_MODE_LOWER`: The provided `*A` pointer points to the lower triangle matrix data. + + - `int n`: Number of rows and columns of $A$. + - `double *A`: Pointer to matrix $A$ in device memory. + - `int lda`: Leading dimension of matrix $A$. + - `double *D`: Pointer to array $W$, where the resulting eigenvalues are written. + - `double *work`: Pointer to working space in device memory. + - `int lwork`: Size of the working space. + - `int *devInfo`: Convergence result of the function in device memory. If 0, the algorithm converged, if greater than 0 then `devInfo` elements of the intermediate tridiagonal matrix did not converge to 0. Also, for CUDA backend, if `devInfo = -i` for $0 < i \leq n$, then the the $i^{th}$ parameter is wrong (not counting the handle). + + Return type: `hipsolverStatus_t`. + - `hipsolver[SD]syevd_bufferSize` allows to obtain the size (in bytes) needed for the working space for the `hipsolver[SD]syevd` function. The character matched in `[SD]` coincides with the one in `hipsolver[SD]syevd`. - This function accepts the following input parameters: - - `hipsolverHandle_t handle` - - `hipsolverEigMode_t jobz`: Specifies whether the eigenvectors should also be calculated besides the eigenvalues. - - `hipSolverFillMode_t uplo`: Specifies whether the upper or lower triangle of the symmetric matrix is stored. - - `int n`: Number of rows and columns of $A$. - - `double *A`: Pointer to matrix $A$ in device memory. - - `int lda`: Leading dimension of matrix $A$. - - `double *D`: Pointer to array $W$ in device memory, where the resulting eigenvalues are written. - - `int *lwork`: The required buffer size is written to this location. + This function accepts the following input parameters: + + - `hipsolverHandle_t handle` + - `hipsolverEigMode_t jobz`: Specifies whether the eigenvectors should also be calculated besides the eigenvalues. + - `hipSolverFillMode_t uplo`: Specifies whether the upper or lower triangle of the symmetric matrix is stored. + - `int n`: Number of rows and columns of $A$. + - `double *A`: Pointer to matrix $A$ in device memory. + - `int lda`: Leading dimension of matrix $A$. + - `double *D`: Pointer to array $W$ in device memory, where the resulting eigenvalues are written. + - `int *lwork`: The required buffer size is written to this location. - Return type: `hipsolverStatus_t`. + Return type: `hipsolverStatus_t`. ## Used API surface + ### hipSOLVER + - `hipsolverCreate` - `hipsolverDsyevd_bufferSize` - `hipsolverDsyevd` @@ -75,6 +91,7 @@ where - `HIPSOLVER_FILL_MODE_UPPER` ### HIP runtime + - `hipMalloc` - `hipMemcpy` - `hipFree` diff --git a/Libraries/hipSOLVER/syevdx/README.md b/Libraries/hipSOLVER/syevdx/README.md index c7137829..2f52a01b 100644 --- a/Libraries/hipSOLVER/syevdx/README.md +++ b/Libraries/hipSOLVER/syevdx/README.md @@ -1,6 +1,7 @@ # hipSOLVER Compatibility API Symmetric Eigenvalue Calculation (divide and conquer algorithm) ## Description + This example illustrates how to solve the standard symmetric-definite eigenvalue problem for a symmetric matrix $A$ using hipSOLVER's [Compatibility API](https://hipsolver.readthedocs.io/en/rocm-5.4.4/compat_index.html). This API offers wrapper functions for the ones existing in hipSOLVER (and their equivalents in [cuSolverDN](https://docs.nvidia.com/cuda/cusolver/index.html#cusolverdn-dense-lapack)) and is intended to be used when porting cuSOLVER applications to hipSOLVER ones. The main advantage of this API is that its functions follow the same method signature format as cuSolverDN's, which makes easier the port. Given an $n \times n$ symmetric matrix $A$, the said problem consists on solving the following equation: @@ -8,6 +9,7 @@ Given an $n \times n$ symmetric matrix $A$, the said problem consists on solving $Ax = \lambda x$. A solution for this problem is given by a pair $(X, \Lambda)$, where + - $X$ is an $n \times n$ orthogonal matrix containing (as columns) the eigenvectors $x_i$ for $i = 0, \dots, n-1$ and - $\Lambda$ is an $n \times n$ diagonal matrix containing the eigenvalues $\lambda_i$ for $i = 0, \dots, n-1$ @@ -16,6 +18,7 @@ such that $A x_i = \lambda_i x_i$ for $i = 0, \dots, n-1$. ### Application flow + 1. Declare and initialize a number of constants for the input matrix. 2. Allocate and initialize the host matrix $A$. 3. Allocate device memory and copy input data from host to device. @@ -28,86 +31,107 @@ $A x_i = \lambda_i x_i$ for $i = 0, \dots, n-1$. 10. Print validation result. ## Key APIs and Concepts + ### hipSOLVER + - hipSOLVER is initialized by calling `hipsolverDnCreate(hipsolverHandle_t*)` and it is terminated by calling `hipsolverDnDestroy(hipsolverHandle_t)`. - In this example `hipsolverDnHandle_t` is used instead of `hipsolverHandle_t`. `hipsolverDnHandle_t` is actually a typedef of `hipsolverHandle_t`, so they can be used equivalently. - `hipsolverDn[SD]syevdx` computes the eigenvalues of an $n \times n$ symmetric matrix $A$. The correct function signature should be chosen based on the datatype of the input matrix: - - `S` (single-precision real: `float`) - - `D` (double-precision real: `double`) - For single- and double-precision complex values, the function `hipsolverDn[CZ]heevdx(...)` is available in hipSOLVER. - - In this example, a double-precision real input matrix is used, in which case the function accepts the following parameters: - - `hipsolverHandle_t handle` - - `hipsolverEigMode_t jobz`: Specifies whether the eigenvectors should also be calculated besides the eigenvalues. The following values are accepted: - - `HIPSOLVER_EIG_MODE_NOVECTOR`: Calculate the eigenvalues only. - - `HIPSOLVER_EIG_MODE_VECTOR`: Calculate both the eigenvalues and the eigenvectors. The eigenvectors are calculated by a divide and conquer algorithm and are written to the memory location specified by `*A`. - - `hipsolverEigRange_t range`: Specifies a range of eigenvalues to be returned. The following values are accepted: - - `HIPSOLVER_EIG_RANGE_ALL`: The whole spectrum is returned. - - `HIPSOLVER_EIG_RANGE_V`: Only the eigenvalues in the interval `(vl, vu]` are returned. `vl` $>$ `vu` must be satisfied. - - `HIPSOLVER_EIG_RANGE_I`: Only the eigenvalues from the `il`-th to the `iu`-th are returned. $1$ $\leq$ `il` $\leq$ `iu` $\leq$ $n$ must be satisfied. + - `S` (single-precision real: `float`) + - `D` (double-precision real: `double`) + + For single- and double-precision complex values, the function `hipsolverDn[CZ]heevdx(...)` is available in hipSOLVER. + + In this example, a double-precision real input matrix is used, in which case the function accepts the following parameters: + + - `hipsolverHandle_t handle` + - `hipsolverEigMode_t jobz`: Specifies whether the eigenvectors should also be calculated besides the eigenvalues. The following values are accepted: + + - `HIPSOLVER_EIG_MODE_NOVECTOR`: Calculate the eigenvalues only. + - `HIPSOLVER_EIG_MODE_VECTOR`: Calculate both the eigenvalues and the eigenvectors. The eigenvectors are calculated by a divide and conquer algorithm and are written to the memory location specified by `*A`. + + - `hipsolverEigRange_t range`: Specifies a range of eigenvalues to be returned. The following values are accepted: + + - `HIPSOLVER_EIG_RANGE_ALL`: The whole spectrum is returned. + - `HIPSOLVER_EIG_RANGE_V`: Only the eigenvalues in the interval `(vl, vu]` are returned. `vl` $>$ `vu` must be satisfied. + - `HIPSOLVER_EIG_RANGE_I`: Only the eigenvalues from the `il`-th to the `iu`-th are returned. $1$ $\leq$ `il` $\leq$ `iu` $\leq$ $n$ must be satisfied. + - `hipSolverFillMode_t uplo`: Specifies whether the upper or lower triangle of the symmetric matrix is stored. The following values are accepted: - - `HIPSOLVER_FILL_MODE_UPPER`: The provided `*A` pointer points to the upper triangle matrix data. - - `HIPSOLVER_FILL_MODE_LOWER`: The provided `*A` pointer points to the lower triangle matrix data. - - `int n`: Number of rows and columns of $A$. - - `double *A`: Pointer to matrix $A$ in device memory. - - `int lda`: Leading dimension of matrix $A$. - - `double vl`: Lower bound of the interval to be searched for eigenvalues if `range` = `HIPSOLVER_EIG_RANGE_V`. - - `double vu`: Upper bound of the interval to be searched for eigenvalues if `range` = `HIPSOLVER_EIG_RANGE_V`. - - `int il`: Smallest index of the eigenvalues to be returned if `range` = `HIPSOLVER_EIG_RANGE_I`. - - `int iu`: Largest index of the eigenvalues to be returned if `range` = `HIPSOLVER_EIG_RANGE_I`. - - `int *nev`: Number of eigenvalues returned. - - `double *W`: Pointer to array $W$ in device memory, where the resulting eigenvalues are written. - - `double *work`: Pointer to working space in device memory. - - `int lwork`: Size of the working space. - - `int *devInfo`: Convergence result of the function in device memory. - - If 0, the algorithm converged. - - If greater than 0 (`devInfo = i` for $0 < i \leq n$), then `devInfo` eigenvectors did not converge. - - For CUDA backend, if lesser than 0 (`devInfo = -i` for $0 < i \leq n$) then the the $i^{th}$ parameter is wrong (not counting the handle). - - Return type: `hipsolverStatus_t`. + + - `HIPSOLVER_FILL_MODE_UPPER`: The provided `*A` pointer points to the upper triangle matrix data. + - `HIPSOLVER_FILL_MODE_LOWER`: The provided `*A` pointer points to the lower triangle matrix data. + + - `int n`: Number of rows and columns of $A$. + - `double *A`: Pointer to matrix $A$ in device memory. + - `int lda`: Leading dimension of matrix $A$. + - `double vl`: Lower bound of the interval to be searched for eigenvalues if `range` = `HIPSOLVER_EIG_RANGE_V`. + - `double vu`: Upper bound of the interval to be searched for eigenvalues if `range` = `HIPSOLVER_EIG_RANGE_V`. + - `int il`: Smallest index of the eigenvalues to be returned if `range` = `HIPSOLVER_EIG_RANGE_I`. + - `int iu`: Largest index of the eigenvalues to be returned if `range` = `HIPSOLVER_EIG_RANGE_I`. + - `int *nev`: Number of eigenvalues returned. + - `double *W`: Pointer to array $W$ in device memory, where the resulting eigenvalues are written. + - `double *work`: Pointer to working space in device memory. + - `int lwork`: Size of the working space. + - `int *devInfo`: Convergence result of the function in device memory. + + - If 0, the algorithm converged. + - If greater than 0 (`devInfo = i` for $0 < i \leq n$), then `devInfo` eigenvectors did not converge. + - For CUDA backend, if lesser than 0 (`devInfo = -i` for $0 < i \leq n$) then the the $i^{th}$ parameter is wrong (not counting the handle). + + Return type: `hipsolverStatus_t`. + - `hipsolverDn[SD]syevdx` internally calls to `cusolverDn[SD]syevdx` for CUDA backend and to a rocSOLVER's internal `syevx` function (not the one from the public API) for ROCm backend, as no `hipsolver[SD]syevdx` function exists yet in hipSOLVER regular API. - `hipsolverDn[SD]syevdx_bufferSize` allows to obtain the size (in bytes) needed for the working space for the `hipsolverDn[SD]syevdx` function. The character matched in `[SD]` coincides with the one in `hipsolverDn[SD]syevdx`. - This function accepts the following input parameters: - - `hipsolverHandle_t handle` - - `hipsolverEigMode_t jobz`: Specifies whether the eigenvectors should also be calculated besides the eigenvalues. - - `hipsolverEigRange_t range`: Specifies a range of eigenvalues to be returned. - - `hipSolverFillMode_t uplo`: Specifies whether the upper or lower triangle of the symmetric matrix is stored. - - `int n`: Number of rows and columns of $A$. - - `double *A`: Pointer to matrix $A$ in device memory. - - `int lda`: Leading dimension of matrix $A$. - - `double vl`: Lower bound of the interval to be searched for eigenvalues if `range` = `HIPSOLVER_EIG_RANGE_V`. - - `double vu`: Upper bound of the interval to be searched for eigenvalues if `range` = `HIPSOLVER_EIG_RANGE_V`. - - `int il`: Smallest index of the eigenvalues to be returned if `range` = `HIPSOLVER_EIG_RANGE_I`. - - `int iu`: Largest index of the eigenvalues to be returned if `range` = `HIPSOLVER_EIG_RANGE_I`. - - `int *nev`: Number of eigenvalues returned. - - `double *W`: Pointer to array $W$ in device memory, where the resulting eigenvalues are written. - - `int *lwork`: The required buffer size is written to this location. - - Return type: `hipsolverStatus_t`. + This function accepts the following input parameters: + + - `hipsolverHandle_t handle` + - `hipsolverEigMode_t jobz`: Specifies whether the eigenvectors should also be calculated besides the eigenvalues. + - `hipsolverEigRange_t range`: Specifies a range of eigenvalues to be returned. + - `hipSolverFillMode_t uplo`: Specifies whether the upper or lower triangle of the symmetric matrix is stored. + - `int n`: Number of rows and columns of $A$. + - `double *A`: Pointer to matrix $A$ in device memory. + - `int lda`: Leading dimension of matrix $A$. + - `double vl`: Lower bound of the interval to be searched for eigenvalues if `range` = `HIPSOLVER_EIG_RANGE_V`. + - `double vu`: Upper bound of the interval to be searched for eigenvalues if `range` = `HIPSOLVER_EIG_RANGE_V`. + - `int il`: Smallest index of the eigenvalues to be returned if `range` = `HIPSOLVER_EIG_RANGE_I`. + - `int iu`: Largest index of the eigenvalues to be returned if `range` = `HIPSOLVER_EIG_RANGE_I`. + - `int *nev`: Number of eigenvalues returned. + - `double *W`: Pointer to array $W$ in device memory, where the resulting eigenvalues are written. + - `int *lwork`: The required buffer size is written to this location. + + Return type: `hipsolverStatus_t`. ### hipBLAS + - For validating the solution we have used the hipBLAS functions `hipblasDdgmm` and `hipblasDgemm`. `hipblasDgemm` computes a general scaled matrix multiplication $\left(C = \alpha \cdot A \cdot B + \beta \cdot C\right)$ and is showcased (strided-batched and with single-precision real type) in the [gemm_strided_batched example](/Libraries/hipBLAS/gemm_strided_batched/). - - The `hipblas[SDCZ]dgmm` function performs a matrix--matrix operation between a diagonal matrix and a general $m \times n$ matrix. The order of the multiplication can be determined using a `hipblasSideMode_t` type parameter: - - `HIPBLAS_SIDE_RIGHT`: the operation performed is $C = A \cdot diag(x)$. This is the one used in the example for computing $X \cdot \Lambda$. - - `HIPBLAS_SIDE_LEFT`: the operation performed is $C = diag(x) \cdot A$. - The correct function signature should be chosen based on the datatype of the input matrices: - - `S` (single-precision real: `float`) - - `D` (double-precision real: `double`) - - `C` (single-precision complex: `hipblasComplex`) - - `Z` (double-precision complex: `hipblasDoubleComplex`). + - The `hipblas[SDCZ]dgmm` function performs a matrix--matrix operation between a diagonal matrix and a general $m \times n$ matrix. The order of the multiplication can be determined using a `hipblasSideMode_t` type parameter: + + - `HIPBLAS_SIDE_RIGHT`: the operation performed is $C = A \cdot diag(x)$. This is the one used in the example for computing $X \cdot \Lambda$. + - `HIPBLAS_SIDE_LEFT`: the operation performed is $C = diag(x) \cdot A$. + + The correct function signature should be chosen based on the datatype of the input matrices: + + - `S` (single-precision real: `float`) + - `D` (double-precision real: `double`) + - `C` (single-precision complex: `hipblasComplex`) + - `Z` (double-precision complex: `hipblasDoubleComplex`). + + Return type: `hipblasStatus_t`. - Return type: `hipblasStatus_t`. - The `hipblasPointerMode_t` type controls whether scalar parameters must be allocated on the host (`HIPBLAS_POINTER_MODE_HOST`) or on the device (`HIPBLAS_POINTER_MODE_DEVICE`). It is set by using `hipblasSetPointerMode`. ## Used API surface + ### hipSOLVER + - `HIPSOLVER_EIG_MODE_VECTOR` - `HIPSOLVER_FILL_MODE_UPPER` ### hipSOLVER Compatibility API + - `HIPSOLVER_EIG_RANGE_I` - `hipsolverDnCreate` - `hipsolverDnDestroy` @@ -116,6 +140,7 @@ $A x_i = \lambda_i x_i$ for $i = 0, \dots, n-1$. - `hipsolverDnHandle_t` ### hipBLAS + - `HIPBLAS_OP_N` - `HIPBLAS_POINTER_MODE_HOST` - `HIPBLAS_SIDE_RIGHT` @@ -127,6 +152,7 @@ $A x_i = \lambda_i x_i$ for $i = 0, \dots, n-1$. - `hipblasSetPointerMode` ### HIP runtime + - `hipFree` - `hipMalloc` - `hipMemcpy` diff --git a/Libraries/hipSOLVER/syevj/README.md b/Libraries/hipSOLVER/syevj/README.md index 1024972c..511c9f03 100644 --- a/Libraries/hipSOLVER/syevj/README.md +++ b/Libraries/hipSOLVER/syevj/README.md @@ -1,6 +1,7 @@ # hipSOLVER Symmetric Eigenvalue via Generalized Jacobi Example ## Description + This example illustrates how to compute the eigenvalues $W$ and eigenvectors $V$ from a symmetric $n \times n$ real matrix $A$ using the Jacobi method. For computing eigenvalues and eigenvectors of Hermitian (complex) matrices, refer to `hipsolver[CZ]heevj`. @@ -11,10 +12,13 @@ The results are verified by filling in the equation we wanted to solve: $A \underset{\text{right}}{\times} V = V \times W$ and checking the error. ### Command line interface + The application has an optional argument: + - `-n ` with size of the $n \times n$ matrix $A$. The default value is `3`. ## Application flow + 1. Parse command line arguments for dimensions of the input matrix. 2. Declare the host side inputs and outputs. 3. Initialize a random symmetric $n \times n$ input matrix. @@ -29,14 +33,19 @@ The application has an optional argument: 12. Free the memory allocations on device. ## Key APIs and Concepts + ### hipSOLVER + - hipSOLVER (`hipsolverHandle_t`) gets initialized by `hipsolverCreate` and destroyed by `hipsolverDestroy`. - `hipsolverEigMode_t`: specifies whether only the eigenvalues or also the eigenvectors should be computed. Passed to `hipsolverDsyevj` as `jobz`. + - `HIPSOLVER_EIG_MODE_VECTOR`: compute the eigenvalues and eigenvectors. - `HIPSOLVER_EIG_MODE_NOVECTOR`: only compute the eigenvalues. + - `hipsolverFillMode_t`: specifies which part of $A$ to use. - `HIPSOLVER_FILL_MODE_LOWER`: data is stored in the lower triangle of $A$ . - `HIPSOLVER_FILL_MODE_UPPER`: data is stored in the upper triangle of $A$ . + - `hipsolverCreateSyevjInfo`: initializes a structure for the parameters and results for calling `syevj`. - `hipsolverDestroySyevjInfo`: destroys the structure for the parameters and results for calling `syevj`. - `hipsolverXsyevjSetMaxSweeps`: configures the max amounts of sweeps @@ -44,13 +53,16 @@ The application has an optional argument: - `hipsolverXsyevjSetSortEig` : configures whether to sort the results or not - `hipsolver[SD]sygvj_bufferSize` computes the required buffersize `lwork` from a given configuration. - `hipsolver[SD]syevj` computes the eigenvalue and optional eigenvector. + - There are 2 different function signatures depending on the type of the input matrix: + - `S` single-precision real (`float`) - `D` double-precision real (`double`) For single- and double-precision complex values, the function `hipsolver[CZ]heevj(...)` is available in hipSOLVER. For example, `hipsolverDsyevj(...)` works on `double`s. For the complex datatypes see `hipsolver[CZ]heevj`. + - `hipsolverHandle_t handle`: hipSOLVER handle, see `hipsolverCreate` - `hipsolverEigMode_t jobz`: eigenvector output mode, see `hipsolverEigMode_t`. - `hipsolverFillMode_t uplo`: fill mode of $A$, see `hipsolverFillMode_t`. @@ -66,7 +78,9 @@ The application has an optional argument: - `hipsolverXsyevjGetResidual`: gets the residual of `syevj`. ## Used API surface + ### hipSOLVER + - `hipsolverCreate` - `hipsolverDestroy` - `hipsolverCreateSyevjInfo` @@ -85,8 +99,8 @@ The application has an optional argument: - `HIPSOLVER_EIG_MODE_VECTOR` - `HIPSOLVER_FILL_MODE_LOWER` - ### HIP runtime + - `hipFree` - `hipMalloc` - `hipMemcpy` diff --git a/Libraries/hipSOLVER/syevj_batched/README.md b/Libraries/hipSOLVER/syevj_batched/README.md index 68c083aa..7cb7b148 100644 --- a/Libraries/hipSOLVER/syevj_batched/README.md +++ b/Libraries/hipSOLVER/syevj_batched/README.md @@ -1,6 +1,7 @@ # hipSOLVER Symmetric Eigenvalue via Generalized Jacobi Batched Example ## Description + This example illustrates how to solve the standard symmetric-definite eigenvalue problem for a batch $A$ of $m$ symmetric matrices $A_i$ using hipSOLVER. That is, showcases how to compute the eigenvalues and eigenvectors of a batch of symmetric matrices. The eigenvectors are computed using the Jacobi method. Given a batch of $m$ symmetric matrices $A_i$ of dimension $n$, the said problem consists on solving the following equation: @@ -10,6 +11,7 @@ $A_ix = \lambda x$ for each $0 \leq i \leq m-1$. A solution for this problem is given by $m$ pairs $(X_i, \Lambda_i)$, where + - $X_i$ is an $n \times n$ orthonormal matrix containing (as columns) the eigenvectors $x_{i_j}$ for $j = 0, \dots, n-1$ and - $\Lambda_i$ is an $n \times n$ diagonal matrix containing the eigenvalues $\lambda_{i_j}$ for $j = 0, \dots, n-1$ @@ -26,12 +28,15 @@ $A_i X_i - X_i \Lambda_i = 0$ for each $0 \leq i \leq m - 1$. ### Command line interface + The application provides the following command line arguments: + - `-h` displays information about the available parameters and their default values. - `-n, --n ` sets the size of the $n \times n$ input matrices in the batch. The default value is `3`. - `-b, --batch_count ` sets `batch_count` as the number of matrices in the batch. The default value is `2`. ## Application flow + 1. Parse command line arguments. 2. Allocate and initialize the host side inputs. 3. Allocate device memory and copy input data from host. @@ -45,14 +50,20 @@ The application provides the following command line arguments: 11. Clean up device allocations and print validation result. ## Key APIs and Concepts + ### hipSOLVER + - hipSOLVER is initialized by calling `hipsolverCreate(hipsolverHandle_t*)` and it is terminated by calling `hipsolverDestroy(hipsolverHandle_t)`. - `hipsolverEigMode_t`: specifies whether only the eigenvalues or also the eigenvectors should be computed. The following values are accepted: + - `HIPSOLVER_EIG_MODE_VECTOR`: compute the eigenvalues and eigenvectors. - `HIPSOLVER_EIG_MODE_NOVECTOR`: only compute the eigenvalues. + - `hipsolverFillMode_t`: specifies whether the upper or lower triangle of each symmetric matrix is stored. The following values are accepted: + - `HIPSOLVER_FILL_MODE_LOWER`: data is stored in the lower triangle of the matrix in the batch. - `HIPSOLVER_FILL_MODE_UPPER`: data is stored in the upper triangle of the matrix in the batch. + - `hipsolverCreateSyevjInfo`: initializes a structure for the input parameters and output information for calling `syevjBatched`. - `hipsolverDestroySyevjInfo`: destroys the structure for the input parameters and output information for calling `syevjBatched`. - `hipsolverXsyevjSetMaxSweeps`: configures the maximum amounts of sweeps. @@ -60,12 +71,14 @@ The application provides the following command line arguments: - `hipsolverXsyevjSetSortEig` : configures whether to sort the eigenvalues (and eigenvectors, if applicable) or not. By default they are always sorted increasingly. - `hipsolver[SD]syevjBatched_bufferSize`: computes the required workspace size `lwork` on the device for a given configuration. - `hipsolver[SD]syevjBatched`: computes the eigenvalues of a batch $A$ of $n \times n$ symmetric matrices $A_i$. The correct function signature should be chosen based on the datatype of the input matrices: + - `S` single-precision real (`float`) - `D` double-precision real (`double`) For single- and double-precision complex values, the function `hipsolver[CZ]heevjBatched(...)` is available in hipSOLVER. In this example, a double-precision real input matrix is used, in which case the function accepts the following parameters: + - `hipsolverHandle_t handle` - `hipsolverEigMode_t jobz` - `hipsolverFillMode_t uplo` @@ -76,30 +89,38 @@ The application provides the following command line arguments: - `double* work`: pointer to working space in device memory. - `int lwork`: size of the working space. - `int* devInfo`: pointer to where the convergence result of the function is written to in device memory. - - If 0, the algorithm converged. - - If greater than 0 (`devInfo = i` for $1 \leq i \leq n$), then `devInfo` eigenvectors did not converge. - - For CUDA backend, if lesser than 0 (`devInfo = -i` for $1 \leq i \leq n$) then the the $i^{th}$ parameter is wrong (not counting the handle). + + - If 0, the algorithm converged. + - If greater than 0 (`devInfo = i` for $1 \leq i \leq n$), then `devInfo` eigenvectors did not converge. + - For CUDA backend, if lesser than 0 (`devInfo = -i` for $1 \leq i \leq n$) then the the $i^{th}$ parameter is wrong (not counting the handle). + - `syevjInfo_t params`: the structure for the input parameters and output information of `syevjBatched`. + - `hipsolverXsyevjGetSweeps`: gets the amount of executed sweeps of `syevjBatched`. Currently it's not supported for the batched version and a `HIPSOLVER_STATUS_NOT_SUPPORTED` error is emitted if this function is invoked. - `hipsolverXsyevjGetResidual`: gets the residual of `syevjBatched`. Currently it's not supported for the batched version and a `HIPSOLVER_STATUS_NOT_SUPPORTED` error is emitted if this function is invoked. ### hipBLAS + - For validating the solution we have used the hipBLAS functions `hipblasDdgmm` and `hipblasDgemm`. `hipblasDgemm` computes a general scaled matrix multiplication $\left(C = \alpha \cdot A \cdot B + \beta \cdot C\right)$ and is showcased (strided-batched and with single-precision real type) in the [gemm_strided_batched example](/Libraries/hipBLAS/gemm_strided_batched/). - - The `hipblas[SDCZ]dgmm` function performs a matrix--matrix operation between a diagonal matrix and a general $m \times n$ matrix. The order of the multiplication can be determined using a `hipblasSideMode_t` type parameter: - - `HIPBLAS_SIDE_RIGHT`: the operation performed is $C = A \cdot diag(x)$. This is the one used in the example for computing $X \cdot \Lambda$. - - `HIPBLAS_SIDE_LEFT`: the operation performed is $C = diag(x) \cdot A$. - The correct function signature should be chosen based on the datatype of the input matrices: - - `S` (single-precision real: `float`) - - `D` (double-precision real: `double`) - - `C` (single-precision complex: `hipblasComplex`) - - `Z` (double-precision complex: `hipblasDoubleComplex`). + - The `hipblas[SDCZ]dgmm` function performs a matrix--matrix operation between a diagonal matrix and a general $m \times n$ matrix. The order of the multiplication can be determined using a `hipblasSideMode_t` type parameter: + + - `HIPBLAS_SIDE_RIGHT`: the operation performed is $C = A \cdot diag(x)$. This is the one used in the example for computing $X \cdot \Lambda$. + - `HIPBLAS_SIDE_LEFT`: the operation performed is $C = diag(x) \cdot A$. - Return type: `hipblasStatus_t`. + The correct function signature should be chosen based on the datatype of the input matrices: + - `S` (single-precision real: `float`) + - `D` (double-precision real: `double`) + - `C` (single-precision complex: `hipblasComplex`) + - `Z` (double-precision complex: `hipblasDoubleComplex`). + + Return type: `hipblasStatus_t`. - The `hipblasPointerMode_t` type controls whether scalar parameters must be allocated on the host (`HIPBLAS_POINTER_MODE_HOST`) or on the device (`HIPBLAS_POINTER_MODE_DEVICE`). It is set by using `hipblasSetPointerMode`. ## Used API surface + ### hipSOLVER + - `HIPSOLVER_EIG_MODE_VECTOR` - `HIPSOLVER_FILL_MODE_LOWER` - `hipsolverCreate` @@ -117,6 +138,7 @@ The application provides the following command line arguments: - `hipsolverXsyevjSetTolerance` ### hipBLAS + - `HIPBLAS_OP_N` - `HIPBLAS_POINTER_MODE_HOST` - `HIPBLAS_SIDE_RIGHT` @@ -128,6 +150,7 @@ The application provides the following command line arguments: - `hipblasSetPointerMode` ### HIP runtime + - `hipFree` - `hipMalloc` - `hipMemcpy` diff --git a/Libraries/hipSOLVER/sygvd/README.md b/Libraries/hipSOLVER/sygvd/README.md index 82e33289..bf6b6eab 100644 --- a/Libraries/hipSOLVER/sygvd/README.md +++ b/Libraries/hipSOLVER/sygvd/README.md @@ -1,9 +1,11 @@ # hipSOLVER Generalized Symmetric Eigenvalue Problem Solver Example ## Description + This example illustrates how to solve the generalized symmetric-definite eigenvalue problem for a given pair of matrices $(A,B)$ using hipSOLVER. Given a pair $(A,B)$ such that + - $A,B \in \mathcal{M}_n(\mathbb{R})$ are symmetric matrices and - $B$ is [positive definite](https://en.wikipedia.org/wiki/Definite_matrix), @@ -12,6 +14,7 @@ the said problem consists on solving the following equation: $(A - \lambda B)x = 0$. Such a solution is given by a pair $(X, \Lambda)$, where + - $X$ is an $n \times n$ orthogonal matrix containing (as columns) the eigenvectors $x_i$ for $i = 0, \dots, n-1$ and - $\Lambda$ is an $n \times n$ diagonal matrix containing the eigenvalues $\lambda_i$ for $i = 0, \dots, n-1$ @@ -20,6 +23,7 @@ such that $(A - \lambda_i B)x_i = 0$ for $i = 0, \dots, n-1$. ### Application flow + 1. Declare and initialize a number of constants for the input matrices. 2. Allocate and initialize the host matrices. 3. Allocate device memory and copy input data from host to device. @@ -31,77 +35,96 @@ $(A - \lambda_i B)x_i = 0$ for $i = 0, \dots, n-1$. 9. Free device memory and the handles. 10. Print validation result. - ## Key APIs and Concepts + ### hipSOLVER + - hipSOLVER is initialized by calling `hipsolverCreate(hipsolverHandle_t*)` and is terminated by calling `hipsolverDestroy(hipsolverHandle_t)`. - `hipsolver[SD]sygvd` computes the eigenvalues and optionally the eigenvectors of an $n \times n$ symmetric pair $(A, B)$, where $B$ is also positive definite. The correct function signature should be chosen based on the datatype of the input pair: - - `S` (single-precision real: `float`) - - `D` (double-precision real: `double`). - - A complex version of this function is also available under the name `hipsolver[CZ]hegvd`. It accepts the same parameters as `hipsolver[SD]sygvd`, except that the correct function signature should be chosen based on the following data types: - - `C` (single-precision complex: `hipFloatComplex`). - - `Z` (double-precision complex: `hipDoubleComplex`). - - In this example, a double-precision real input pair is used, in which case the function accepts the following parameters: - - `hipsolverHandle_t handle` - - `hipsolverEigType_t itype`: Specifies the problem type to be solved: - - `HIPSOLVER_EIG_TYPE_1`: $A \cdot X = B \cdot X \cdot \Lambda$ - - `HIPSOLVER_EIG_TYPE_2`: $A \cdot B \cdot X = X \cdot \Lambda$ - - `HIPSOLVER_EIG_TYPE_3`: $B \cdot A \cdot X = X \cdot \Lambda$ - - `hipsolverEigMode_t jobz`: Specifies whether the eigenvectors should also be calculated besides the eigenvalues. The following values are accepted: - - `HIPSOLVER_EIG_MODE_NOVECTOR`: calculate the eigenvalues only. - - `HIPSOLVER_EIG_MODE_VECTOR`: calculate both the eigenvalues and the eigenvectors. The eigenvectors are calculated using a divide-and-conquer algorithm and are overwritten to the device memory location pointed by `*A`. - - `hipSolverFillMode_t uplo`: Specifies which part of input matrices $A$ and $B$ are stored. The following values are accepted: - - `HIPSOLVER_FILL_MODE_UPPER`: The provided `*A` and `*B` pointers point to the upper triangle matrix data. - - `HIPSOLVER_FILL_MODE_LOWER`: The provided `*A` and `*B` pointers point to the lower triangle matrix data. - - `int n`: Dimension of matrices $A$ and $B$. - - `double *A`: Pointer to matrix $A$ in device memory. - - `int lda`: Leading dimension of matrix $A$. - - `double *B`: Pointer to matrix $B$ in device memory. - - `int ldb`: Leading dimension of matrix $B$. - - `double *W`: Pointer to vector in device memory representing the diagonal of matrix $\Lambda$, where the resulting eigenvalues are written. - - `double *work`: Pointer to working space. - - `int lwork`: Size of the working space, obtained with `hipsolverDsygvd_bufferSize`. - - `int *devInfo`: Convergence result of the function. If 0, the algorithm converged. If greater than 0 and: - - `devInfo = i` for $0 < i \leq n$, then `devInfo` elements of the intermediate tridiagonal matrix did not converge to 0. - - `devInfo = n + i` for $0 < i \leq n$, the leading minor of order $i$ of $B$ is not positive definite. - - Return type: `hipsolverStatus_t`. + + - `S` (single-precision real: `float`) + - `D` (double-precision real: `double`). + + A complex version of this function is also available under the name `hipsolver[CZ]hegvd`. It accepts the same parameters as `hipsolver[SD]sygvd`, except that the correct function signature should be chosen based on the following data types: + + - `C` (single-precision complex: `hipFloatComplex`). + - `Z` (double-precision complex: `hipDoubleComplex`). + + In this example, a double-precision real input pair is used, in which case the function accepts the following parameters: + + - `hipsolverHandle_t handle` + - `hipsolverEigType_t itype`: Specifies the problem type to be solved: + + - `HIPSOLVER_EIG_TYPE_1`: $A \cdot X = B \cdot X \cdot \Lambda$ + - `HIPSOLVER_EIG_TYPE_2`: $A \cdot B \cdot X = X \cdot \Lambda$ + - `HIPSOLVER_EIG_TYPE_3`: $B \cdot A \cdot X = X \cdot \Lambda$ + + - `hipsolverEigMode_t jobz`: Specifies whether the eigenvectors should also be calculated besides the eigenvalues. The following values are accepted: + + - `HIPSOLVER_EIG_MODE_NOVECTOR`: calculate the eigenvalues only. + - `HIPSOLVER_EIG_MODE_VECTOR`: calculate both the eigenvalues and the eigenvectors. The eigenvectors are calculated using a divide-and-conquer algorithm and are overwritten to the device memory location pointed by `*A`. + + - `hipSolverFillMode_t uplo`: Specifies which part of input matrices $A$ and $B$ are stored. The following values are accepted: + + - `HIPSOLVER_FILL_MODE_UPPER`: The provided `*A` and `*B` pointers point to the upper triangle matrix data. + - `HIPSOLVER_FILL_MODE_LOWER`: The provided `*A` and `*B` pointers point to the lower triangle matrix data. + - `int n`: Dimension of matrices $A$ and $B$. + - `double *A`: Pointer to matrix $A$ in device memory. + - `int lda`: Leading dimension of matrix $A$. + - `double *B`: Pointer to matrix $B$ in device memory. + - `int ldb`: Leading dimension of matrix $B$. + - `double *W`: Pointer to vector in device memory representing the diagonal of matrix $\Lambda$, where the resulting eigenvalues are written. + - `double *work`: Pointer to working space. + - `int lwork`: Size of the working space, obtained with `hipsolverDsygvd_bufferSize`. + - `int *devInfo`: Convergence result of the function. If 0, the algorithm converged. If greater than 0 and: + + - `devInfo = i` for $0 < i \leq n$, then `devInfo` elements of the intermediate tridiagonal matrix did not converge to 0. + - `devInfo = n + i` for $0 < i \leq n$, the leading minor of order $i$ of $B$ is not positive definite. + + Return type: `hipsolverStatus_t`. + - `hipsolver[SD]sygvd_bufferSize` allows to obtain the size (in bytes) needed for the working space of the `hipsolver[SD]sygvd` function. The character matched in `[SD]` coincides with the one in `hipsolver[SD]sygvd`. - This function accepts the following input parameters: - - `hipsolverHandle_t handle` - - `hipsolverEigType_t itype`: Specifies the problem type to be solved. - - `hipsolverEigMode_t jobz`: Specifies whether the eigenvectors should also be calculated besides the eigenvalues. - - `hipSolverFillMode_t uplo`: Specifies whether the upper or lower triangles of the of the symmetric input matrices $A$ and $B$ are stored. - - `int n`: Simension of matrices $A$ and $B$. - - `double *A`: Pointer to matrix $A$ in device memory. - - `int lda`: Leading dimension of matrix $A$. - - `double *B`: Pointer to matrix $B$ in device memory. - - `int ldb`: Leading dimension of matrix $B$. - - `double *W`: Pointer to vector in device memory representing the diagonal of matrix $\Lambda$, where the resulting eigenvalues are written. - - `int *lwork`: The required buffer size is written to this location. - - Return type: `hipsolverStatus_t`. + This function accepts the following input parameters: + + - `hipsolverHandle_t handle` + - `hipsolverEigType_t itype`: Specifies the problem type to be solved. + - `hipsolverEigMode_t jobz`: Specifies whether the eigenvectors should also be calculated besides the eigenvalues. + - `hipSolverFillMode_t uplo`: Specifies whether the upper or lower triangles of the of the symmetric input matrices $A$ and $B$ are stored. + - `int n`: Simension of matrices $A$ and $B$. + - `double *A`: Pointer to matrix $A$ in device memory. + - `int lda`: Leading dimension of matrix $A$. + - `double *B`: Pointer to matrix $B$ in device memory. + - `int ldb`: Leading dimension of matrix $B$. + - `double *W`: Pointer to vector in device memory representing the diagonal of matrix $\Lambda$, where the resulting eigenvalues are written. + - `int *lwork`: The required buffer size is written to this location. + + Return type: `hipsolverStatus_t`. ### hipBLAS + - For validating the solution we have used the hipBLAS functions `hipblasDdgmm` and `hipblasDgemm`. `hipblasDgemm` is showcased (strided-batched and with single-precision real type) in the [gemm_strided_batched example](/Libraries/hipBLAS/gemm_strided_batched/). - - The `hipblas[SDCZ]dgmm` function performs a matrix--matrix operation between a diagonal matrix and a general $m \times n$ matrix. The order of the multiplication can be determined using a `hipblasSideMode_t` type parameter: - - `HIPBLAS_SIDE_RIGHT`: the operation performed is $C = A \cdot diag(x)$. This is the one used in the example for computing $X \cdot \Lambda$. - - `HIPBLAS_SIDE_LEFT`: the operation performed is $C = diag(x) \cdot A$. - The correct function signature should be chosen based on the datatype of the input matrices: - - `S` (single-precision real: `float`) - - `D` (double-precision real: `double`) - - `C` (single-precision complex: `hipblasComplex`) - - `Z` (double-precision complex: `hipblasDoubleComplex`). + - The `hipblas[SDCZ]dgmm` function performs a matrix--matrix operation between a diagonal matrix and a general $m \times n$ matrix. The order of the multiplication can be determined using a `hipblasSideMode_t` type parameter: + + - `HIPBLAS_SIDE_RIGHT`: the operation performed is $C = A \cdot diag(x)$. This is the one used in the example for computing $X \cdot \Lambda$. + - `HIPBLAS_SIDE_LEFT`: the operation performed is $C = diag(x) \cdot A$. + + The correct function signature should be chosen based on the datatype of the input matrices: + + - `S` (single-precision real: `float`) + - `D` (double-precision real: `double`) + - `C` (single-precision complex: `hipblasComplex`) + - `Z` (double-precision complex: `hipblasDoubleComplex`). + + Return type: `hipblasStatus_t`. - Return type: `hipblasStatus_t`. - The `hipblasPointerMode_t` type controls whether scalar parameters must be allocated on the host (`HIPBLAS_POINTER_MODE_HOST`) or on the device (`HIPBLAS_POINTER_MODE_DEVICE`). It is set by using `hipblasSetPointerMode`. ## Used API surface + ### hipSOLVER + - `hipsolverCreate` - `hipsolverDsygvd_bufferSize` - `hipsolverDsygvd` @@ -112,6 +135,7 @@ $(A - \lambda_i B)x_i = 0$ for $i = 0, \dots, n-1$. - `HIPSOLVER_FILL_MODE_UPPER` ### hipBLAS + - `hipblasCreate` - `hipblasDestroy` - `hipblasDdgmm` @@ -123,6 +147,7 @@ $(A - \lambda_i B)x_i = 0$ for $i = 0, \dots, n-1$. - `HIPBLAS_SIDE_RIGHT` ### HIP runtime + - `hipMalloc` - `hipMemcpy` - `hipFree` diff --git a/Libraries/hipSOLVER/sygvj/README.md b/Libraries/hipSOLVER/sygvj/README.md index 4345fad1..77dcccc5 100644 --- a/Libraries/hipSOLVER/sygvj/README.md +++ b/Libraries/hipSOLVER/sygvj/README.md @@ -1,6 +1,5 @@ # hipSOLVER Generalized Dense Symmetric Eigenvalue calculation (Jacobi algorithm) - ## Description _[This example currently works only on rocSOLVER backend.](https://github.com/ROCmSoftwarePlatform/hipSOLVER/issues/152)_ @@ -12,6 +11,7 @@ The generalized eigenvalues and eigenvectors of a matrix pair are defined as: $Av_i = \lambda_i Bv_i$ where + - $A,B\in\mathbb{R}^{n\times n}$ are symmetric matrices, - $\lambda_i$ for $i\in\{1, \dots n\}$ are eigenvalues, - $v_i\in\mathbb{R}^n$ are eigenvectors corresponding to the $i$-th eigenvalue (they can be normalized to unit length). @@ -19,6 +19,7 @@ where This choice corresponds to `HIPSOLVER_EIG_TYPE_1` parameter value of the solver function `hipsolverDsygvj`. Two other possibilities include $ABv_i = \lambda_i v_i$ for `HIPSOLVER_EIG_TYPE_2` and $BAv_i = \lambda_i v_i$ for `HIPSOLVER_EIG_TYPE_3`. ### Application flow + 1. Instantiate two vectors of size $n\times n$ for $n=3$ containing $A$'s and $B$'s elements. 2. Allocate device memory and copy $A$'s and $B$'s elements to the device. 3. Allocate device memory for the outputs of the hipSOLVER function, namely for the calculated eigenvalue vector $W=[\lambda_1, \lambda_2, \lambda_3]$, and the returned `info` value. @@ -31,60 +32,68 @@ This choice corresponds to `HIPSOLVER_EIG_TYPE_1` parameter value of the solver 10. Copy residual and executed sweeps number to the host. Print their values. 11. Free all allocated resources. - ## Key APIs and Concepts + ### hipSOLVER + - hipSOLVER is initialized by calling `hipsolverCreate(hipsolverHandle_t*)` and it is terminated by calling `hipsolverDestroy(hipsolverHandle_t)`. -- `hipsolver[SD]sygvj` computes the generalized eigenvalues of an $n \times n$ matrix pair $A$ and $B$. The correct function signature should be chosen based on the datatype of the input matrix:: - - `S` (single-precision real: `float`) - - `D` (double-precision real: `double`) - - For single- and double-precision complex values, the function `hipsolver[CZ]hegvj(...)` is available in hipSOLVER. - - In this example, a double-precision real input matrix pair is used, in which case the function accepts the following parameters: - - `hipsolverHandle_t handle` - - `hipsolverEigType_t itype`: Specifies the type of eigensystem problem, see [above](#description). - - `hipsolverEigMode_t jobz`: Specifies whether the eigenvectors should also be calculated besides the eigenvalues. The following values are accepted: - - `HIPSOLVER_EIG_MODE_NOVECTOR`: Calculate the eigenvalues only. - - `HIPSOLVER_EIG_MODE_VECTOR`: Calculate both the eigenvalues and the eigenvectors. The eigenvectors are calculated using the Jacobi method and written to the memory location specified by `*A`. - - `hipSolverFillMode_t uplo`: Specifies whether the upper or lower triangle of the symmetric matrix is stored. The following values are accepted: - - `HIPSOLVER_FILL_MODE_UPPER`: The provided `*A` pointer points to the upper triangle matrix data. - - `HIPSOLVER_FILL_MODE_LOWER`: The provided `*A` pointer points to the lower triangle matrix data. - - `int n`: Number of rows and columns of $A$. - - `double *A`: Pointer to matrix $A$ in device memory. - - `int lda`: Leading dimension of matrix $A$. - - `double *B`: Pointer to matrix $B$ in device memory. - - `int ldb`: Leading dimension of matrix $B$. - - `double *D`: Pointer to array $W$, where the resulting eigenvalues are written. - - `double *work`: Pointer to working space in device memory. - - `int lwork`: Size of the working space. - - `int *devInfo`: Convergence result of the function in device memory. If 0, the algorithm converged, if greater than 0 and less or equal to `n` then `devInfo`-th leading minor of `B` is not positive definite, if equal to `n+1` than convergence is not achieved. Also, for CUDA backend, if `devInfo = -i` for $0 < i \leq n$, then the the $i^{th}$ parameter is wrong (not counting the handle). - - `hipsolverSyevjInfo_t params`: Pointer to the structure with parameters of solver, that should be created by function `hipsolverCreateSyevjInfo(¶ms)`. Solver has two parameters: - - Tolerance `tol`, set by function `hipsolverXsyevjSetTolerance(syevj_params, tol)`, default value of tolerance is machine zero. - - Maximal number of sweeps to obtain convergence `max_sweeps`, set by function `hipsolverXsyevjSetMaxSweeps(syevj_params, max_sweeps)`, default value is 100. - - Return type: `hipsolverStatus_t`. +- `hipsolver[SD]sygvj` computes the generalized eigenvalues of an $n \times n$ matrix pair $A$ and $B$. The correct function signature should be chosen based on the datatype of the input matrix: + + - `S` (single-precision real: `float`) + - `D` (double-precision real: `double`) + + For single- and double-precision complex values, the function `hipsolver[CZ]hegvj(...)` is available in hipSOLVER. + + In this example, a double-precision real input matrix pair is used, in which case the function accepts the following parameters: + + - `hipsolverHandle_t handle` + - `hipsolverEigType_t itype`: Specifies the type of eigensystem problem, see [above](#description). + - `hipsolverEigMode_t jobz`: Specifies whether the eigenvectors should also be calculated besides the eigenvalues. The following values are accepted: + - `HIPSOLVER_EIG_MODE_NOVECTOR`: Calculate the eigenvalues only. + - `HIPSOLVER_EIG_MODE_VECTOR`: Calculate both the eigenvalues and the eigenvectors. The eigenvectors are calculated using the Jacobi method and written to the memory location specified by `*A`. + - `hipSolverFillMode_t uplo`: Specifies whether the upper or lower triangle of the symmetric matrix is stored. The following values are accepted: + - `HIPSOLVER_FILL_MODE_UPPER`: The provided `*A` pointer points to the upper triangle matrix data. + - `HIPSOLVER_FILL_MODE_LOWER`: The provided `*A` pointer points to the lower triangle matrix data. + - `int n`: Number of rows and columns of $A$. + - `double *A`: Pointer to matrix $A$ in device memory. + - `int lda`: Leading dimension of matrix $A$. + - `double *B`: Pointer to matrix $B$ in device memory. + - `int ldb`: Leading dimension of matrix $B$. + - `double *D`: Pointer to array $W$, where the resulting eigenvalues are written. + - `double *work`: Pointer to working space in device memory. + - `int lwork`: Size of the working space. + - `int *devInfo`: Convergence result of the function in device memory. If 0, the algorithm converged, if greater than 0 and less or equal to `n` then `devInfo`-th leading minor of `B` is not positive definite, if equal to `n+1` than convergence is not achieved. Also, for CUDA backend, if `devInfo = -i` for $0 < i \leq n$, then the the $i^{th}$ parameter is wrong (not counting the handle). + - `hipsolverSyevjInfo_t params`: Pointer to the structure with parameters of solver, that should be created by function `hipsolverCreateSyevjInfo(¶ms)`. Solver has two parameters: + + - Tolerance `tol`, set by function `hipsolverXsyevjSetTolerance(syevj_params, tol)`, default value of tolerance is machine zero. + - Maximal number of sweeps to obtain convergence `max_sweeps`, set by function `hipsolverXsyevjSetMaxSweeps(syevj_params, max_sweeps)`, default value is 100. + + Return type: `hipsolverStatus_t`. + - `hipsolver[SD]sygvj_bufferSize` allows to obtain the size (in bytes) needed for the working space for the `hipsolver[SD]sygvj` function. The character matched in `[SD]` coincides with the one in `hipsolver[SD]sygvj`. - This function accepts the following input parameters: - - `hipsolverHandle_t handle` - - `hipsolverEigType_t itype`: Specifies the type of eigensystem problem. - - `hipsolverEigMode_t jobz`: Specifies whether the eigenvectors should also be calculated besides the eigenvalues. - - `hipSolverFillMode_t uplo`: Specifies whether the upper or lower triangle of the symmetric matrix is stored. - - `int n`: Number of rows and columns of $A$. - - `double *A`: Pointer to matrix $A$ in device memory. - - `int lda`: Leading dimension of matrix $A$. - - `double *B`: Pointer to matrix $B$ in device memory. - - `int ldb`: Leading dimension of matrix $B$. - - `double *D`: Pointer to array $W$ in device memory, where the resulting eigenvalues are written. - - `int *lwork`: The required buffer size is written to this location. - - `hipsolverSyevjInfo_t params`: Pointer to the structure with parameters of solver. - - Return type: `hipsolverStatus_t`. + This function accepts the following input parameters: + - `hipsolverHandle_t handle` + - `hipsolverEigType_t itype`: Specifies the type of eigensystem problem. + - `hipsolverEigMode_t jobz`: Specifies whether the eigenvectors should also be calculated besides the eigenvalues. + - `hipSolverFillMode_t uplo`: Specifies whether the upper or lower triangle of the symmetric matrix is stored. + - `int n`: Number of rows and columns of $A$. + - `double *A`: Pointer to matrix $A$ in device memory. + - `int lda`: Leading dimension of matrix $A$. + - `double *B`: Pointer to matrix $B$ in device memory. + - `int ldb`: Leading dimension of matrix $B$. + - `double *D`: Pointer to array $W$ in device memory, where the resulting eigenvalues are written. + - `int *lwork`: The required buffer size is written to this location. + - `hipsolverSyevjInfo_t params`: Pointer to the structure with parameters of solver. + + Return type: `hipsolverStatus_t`. ## Used API surface + ### hipSOLVER + Types: + - `hipsolverHandle_t` - `hipsolverSyevjInfo_t` - `hipsolverEigType_t` @@ -92,6 +101,7 @@ Types: - `hipsolverFillMode_t` Functions: + - `hipsolverCreate` - `hipsolverDestroy` - `hipsolverCreateSyevjInfo` @@ -104,6 +114,7 @@ Functions: - `hipsolverDsygvj` ### HIP runtime + - `hipMalloc` - `hipMemcpy` - `hipFree` diff --git a/Libraries/rocBLAS/README.md b/Libraries/rocBLAS/README.md index ef829cca..47fc7acf 100644 --- a/Libraries/rocBLAS/README.md +++ b/Libraries/rocBLAS/README.md @@ -1,30 +1,36 @@ # rocBLAS Examples ## Summary + The examples in this subdirectory showcase the functionality of the [rocBLAS](https://github.com/ROCmSoftwarePlatform/rocBLAS) library. The examples build on both Linux and Windows for the ROCm (AMD GPU) backend. ## Prerequisites + ### Linux + - [CMake](https://cmake.org/download/) (at least version 3.21) - OR GNU Make - available via the distribution's package manager - [ROCm](https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.2/page/Overview_of_ROCm_Installation_Methods.html) (at least version 5.x.x) - [rocBLAS](https://github.com/ROCmSoftwarePlatform/rocBLAS): `rocblas` package available from [repo.radeon.com](https://repo.radeon.com/rocm/). The repository is added during the standard ROCm [install procedure](https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.2/page/How_to_Install_ROCm.html). - ### Windows + - [Visual Studio](https://visualstudio.microsoft.com/) 2019 or 2022 with the "Desktop Development with C++" workload - ROCm toolchain for Windows (No public release yet) - The Visual Studio ROCm extension needs to be installed to build with the solution files. - [rocBLAS](https://github.com/ROCmSoftwarePlatform/rocBLAS) - - Installed as part of the ROCm SDK on Windows for ROCm platform. + - Installed as part of the ROCm SDK on Windows for ROCm platform. - [CMake](https://cmake.org/download/) (optional, to build with CMake. Requires at least version 3.21) - [Ninja](https://ninja-build.org/) (optional, to build with CMake) ## Building + ### Linux + Make sure that the dependencies are installed, or use the [provided Dockerfiles](../../Dockerfiles/) to build and run the examples in a containerized environment that has all prerequisites installed. #### Using CMake + All examples in the `rocBLAS` subdirectory can either be built by a single CMake project or be built independently. - `$ cd Libraries/rocBLAS` @@ -32,16 +38,20 @@ All examples in the `rocBLAS` subdirectory can either be built by a single CMake - `$ cmake --build build` #### Using Make + All examples can be built by a single invocation to Make or be built independently. - `$ cd Libraries/rocBLAS` - `$ make` ### Windows + #### Visual Studio + Visual Studio solution files are available for the individual examples. To build all examples for rocBLAS open the top level solution file [ROCm-Examples-VS2019.sln](../../ROCm-Examples-VS2019.sln) and filter for rocBLAS. For more detailed build instructions refer to the top level [README.md](../../README.md#visual-studio). #### CMake + All examples in the `rocBLAS` subdirectory can either be built by a single CMake project or be built independently. For build instructions refer to the top-level [README.md](../../README.md#cmake-2). diff --git a/Libraries/rocBLAS/level_1/axpy/README.md b/Libraries/rocBLAS/level_1/axpy/README.md index bbf6f3f8..5f79413e 100644 --- a/Libraries/rocBLAS/level_1/axpy/README.md +++ b/Libraries/rocBLAS/level_1/axpy/README.md @@ -1,9 +1,11 @@ # rocBLAS Level 1 AXPY Example ## Description + This example showcases the usage of rocBLAS' Level 1 AXPY function. The Level 1 API defines operations between vector and vector. AXPY is the operation $y_i=ax_i+y_i$ for two vectors $x$ and $y$, and a scalar value $a$. -### Application flow +### Application flow + 1. Read in command-line parameters. 2. Allocate and initialize host vectors. 3. Compute CPU reference result. @@ -13,23 +15,29 @@ This example showcases the usage of rocBLAS' Level 1 AXPY function. The Level 1 7. Call rocBLAS' AXPY function. 8. Copy the result from device to host. 9. Destroy the rocBLAS handle, release device memory. -10. Validate the output by comparing it to the CPU reference result. +10. Validate the output by comparing it to the CPU reference result. ### Command line interface + The application provides the following optional command line arguments: + - `-a` or `--alpha`. The scalar value $a$ used in the AXPY operation. Its default value is 1. - `-x` or `--incx`. The stride between consecutive values in the data array that makes up vector $x$, must be greater than zero. Its default value is 1. - `-y` or `--incy`. The stride between consecutive values in the data array that makes up vector $y$, must be greater than zero. Its default value is 1. - `-n` or `--n`. The number of elements in vectors $x$ and $y$, must be greater than zero. Its default value is 5. ## Key APIs and Concepts + - rocBLAS is initialized by calling `rocblas_create_handle(rocblas_handle*)` and it is terminated by calling `rocblas_destroy_handle(rocblas_handle)`. + - The _pointer mode_ controls whether scalar parameters must be allocated on the host (`rocblas_pointer_mode_host`) or on the device (`rocblas_pointer_mode_device`). It is controlled by `rocblas_set_pointer_mode`. + - `rocblas_Xaxpy` computes the AXPY operation as defined above. `X` is one of `s` (single-precision: `rocblas_float`), `d` (double-precision: `rocblas_double`), `h` (half-precision: `rocblas_half`), `c` (single-precision complex: `rocblas_complex`), or `z` (double-precision complex: `rocblas_double_complex`). ## Demonstrated API Calls ### rocBLAS + - `rocblas_create_handle` - `rocblas_destroy_handle` - `rocblas_float` @@ -42,6 +50,7 @@ The application provides the following optional command line arguments: - `rocblas_status_to_string` ### HIP runtime + - `hipFree` - `hipMalloc` - `hipMemcpy` diff --git a/Libraries/rocBLAS/level_1/dot/README.md b/Libraries/rocBLAS/level_1/dot/README.md index 8db93906..867389d8 100644 --- a/Libraries/rocBLAS/level_1/dot/README.md +++ b/Libraries/rocBLAS/level_1/dot/README.md @@ -1,9 +1,11 @@ # rocBLAS Level 1 Dot Example ## Description + This example showcases the usage of rocBLAS' Level 1 DOT function. The Level 1 API defines operations between vector and vector. DOT is a dot product operator between $x$ and $y$ vectors defined as $\sum_i{x_i \cdot y_i}$. -### Application flow +### Application flow + 1. Read in and parse command line parameters. 2. Allocate and initialize host vectors. 3. Compute CPU reference result. @@ -13,22 +15,28 @@ This example showcases the usage of rocBLAS' Level 1 DOT function. The Level 1 A 7. Call `rocblas_sdot()` asynchronous rocBLAS dot product function. 8. Copy the result from device to host. 9. Destroy the rocBLAS handle, release device memory. -10. Validate the output by comparing it to the CPU reference result. +10. Validate the output by comparing it to the CPU reference result. ### Command line interface + The application provides the following optional command line arguments: + - `-x` or `--incx`. The stride between consecutive values in the data array that makes up vector $x$, must be greater than zero. Its default value is 1. - `-y` or `--incy`. The stride between consecutive values in the data array that makes up vector $y$, must be greater than zero. Its default value is 1. - `-n`. The number of elements in vectors $x$ and $y$. Its default value is 5. ## Key APIs and Concepts + - rocBLAS is initialized by calling `rocblas_create_handle(rocblas_handle*)` and it is terminated by calling `rocblas_destroy_handle(rocblas_handle)`. + - The _pointer mode_ controls whether scalar parameters must be allocated on the host (`rocblas_pointer_mode_host`) or on the device (`rocblas_pointer_mode_device`). It is controlled by `rocblas_set_pointer_mode`. + - `rocblas_Xdot` computes the dot product of two vectors as defined above. `X` is one of `s` (single-precision: `rocblas_float`), `d` (double-precision: `rocblas_double`), `h` (half-precision: `rocblas_half`), `c` (single-precision complex: `rocblas_complex`), or `z` (double-precision complex: `rocblas_double_complex`). ## Demonstrated API Calls ### rocBLAS + - `rocblas_create_handle` - `rocblas_destroy_handle` - `rocblas_handle` @@ -42,6 +50,7 @@ The application provides the following optional command line arguments: - `rocblas_status_to_string` ### HIP runtime + - `hipFree` - `hipMalloc` - `hipMemcpy` diff --git a/Libraries/rocBLAS/level_1/nrm2/README.md b/Libraries/rocBLAS/level_1/nrm2/README.md index 17a60217..671e075f 100644 --- a/Libraries/rocBLAS/level_1/nrm2/README.md +++ b/Libraries/rocBLAS/level_1/nrm2/README.md @@ -1,9 +1,11 @@ # rocBLAS Level 1 NRM2 Example ## Description + This example showcases the usage of rocBLAS' Level 1 NRM2 function. NRM2 is an Euclidean norm operator applied on $x$ vector defined as $\sqrt \left( \sum_i{x_i^2} \right)$. -### Application flow +### Application flow + 1. Read in and parse command line parameters. 2. Allocate and initialize host vector. 3. Compute CPU reference result. @@ -13,26 +15,33 @@ This example showcases the usage of rocBLAS' Level 1 NRM2 function. NRM2 is an E 7. Call `rocblas_snrm2()` asynchronous rocBLAS Euclidean norm function. 8. Copy the result from device to host. 9. Destroy the rocBLAS handle, release device memory. -10. Validate the output by comparing it to the CPU reference result. +10. Validate the output by comparing it to the CPU reference result. ### Command line interface + The application provides the following optional command line arguments: + - `-x` or `--incx`. The stride between consecutive values in the data array that makes up vector $x$, must be greater than zero. Its default value is 1. - `-n`. The number of elements in vectors $x$. Its default value is 5. ## Key APIs and Concepts + - rocBLAS is initialized by calling `rocblas_create_handle(rocblas_handle*)` and it is terminated by calling `rocblas_destroy_handle(rocblas_handle)`. + - The _pointer mode_ controls whether scalar parameters must be allocated on the host (`rocblas_pointer_mode_host`) or on the device (`rocblas_pointer_mode_device`). It is controlled by `rocblas_set_pointer_mode`. + - `rocblas_[sdhcz]nrm2` computes the Euclidean norm of a vector as defined above. Depending on the character matched in `[sdhcz]`, the norm can be obtained with different precisions: - - `s` (single-precision: `rocblas_float`) - - `d` (double-precision: `rocblas_double`) - - `h` (half-precision: `rocblas_half`) - - `c` (single-precision complex: `rocblas_complex`) - - `z` (double-precision complex: `rocblas_double_complex`). + + - `s` (single-precision: `rocblas_float`) + - `d` (double-precision: `rocblas_double`) + - `h` (half-precision: `rocblas_half`) + - `c` (single-precision complex: `rocblas_complex`) + - `z` (double-precision complex: `rocblas_double_complex`). ## Demonstrated API Calls ### rocBLAS + - `rocblas_create_handle` - `rocblas_destroy_handle` - `rocblas_handle` @@ -45,6 +54,7 @@ The application provides the following optional command line arguments: - `rocblas_status_to_string` ### HIP runtime + - `hipFree` - `hipMalloc` - `hipMemcpy` diff --git a/Libraries/rocBLAS/level_1/scal/README.md b/Libraries/rocBLAS/level_1/scal/README.md index e3119aa9..d7d4955e 100644 --- a/Libraries/rocBLAS/level_1/scal/README.md +++ b/Libraries/rocBLAS/level_1/scal/README.md @@ -1,9 +1,11 @@ # rocBLAS Level 1 Scal Example ## Description + This example showcases the usage of rocBLAS' Level 1 SCAL function. The Level 1 API defines operations between vector and vector. SCAL is a scaling operator for an $x$ vector defined as $x_i := \alpha \cdot x_i$. -### Application flow +### Application flow + 1. Read in and parse command line parameters. 2. Allocate and initialize host vector. 3. Compute CPU reference result. @@ -13,26 +15,33 @@ This example showcases the usage of rocBLAS' Level 1 SCAL function. The Level 1 7. Call rocBLAS' SCAL function. 8. Copy the result from device to host. 9. Destroy the rocBLAS handle, release device memory. -10. Validate the output by comparing it to the CPU reference result. +10. Validate the output by comparing it to the CPU reference result. ### Command line interface + The application provides the following optional command line arguments: + - `-a` or `--alpha`. The scalar value $a$ used in the SCAL operation. Its default value is 3. - `-x` or `--incx`. The stride between consecutive values in the data array that makes up vector $x$, must be greater than zero. Its default value is 1. - `-n` or `--n`. The number of elements in vector $x$, must be greater than zero. Its default value is 5. ## Key APIs and Concepts + - rocBLAS is initialized by calling `rocblas_create_handle(rocblas_handle*)` and it is terminated by calling `rocblas_destroy_handle(rocblas_handle)`. + - The _pointer mode_ controls whether scalar parameters must be allocated on the host (`rocblas_pointer_mode_host`) or on the device (`rocblas_pointer_mode_device`). It is controlled by `rocblas_set_pointer_mode`. + - `rocblas_[sdcz]scal` multiplies each element of the vector by a scalar. Depending on the character matched in `[sdcz]`, the scaling can be obtained with different precisions: - - `s` (single-precision: `rocblas_float`) - - `d` (double-precision: `rocblas_double`) - - `c` (single-precision complex: `rocblas_complex`) - - `z` (double-precision complex: `rocblas_double_complex`). + + - `s` (single-precision: `rocblas_float`) + - `d` (double-precision: `rocblas_double`) + - `c` (single-precision complex: `rocblas_complex`) + - `z` (double-precision complex: `rocblas_double_complex`). ## Demonstrated API Calls ### rocBLAS + - `rocblas_create_handle` - `rocblas_destroy_handle` - `rocblas_handle` @@ -43,6 +52,7 @@ The application provides the following optional command line arguments: - `rocblas_sscal` ### HIP runtime + - `hipFree` - `hipMalloc` - `hipMemcpy` diff --git a/Libraries/rocBLAS/level_1/swap/README.md b/Libraries/rocBLAS/level_1/swap/README.md index b858dc54..75d68120 100644 --- a/Libraries/rocBLAS/level_1/swap/README.md +++ b/Libraries/rocBLAS/level_1/swap/README.md @@ -1,9 +1,11 @@ # rocBLAS Level 1 Swap Example ## Description + This example shows the use of the rocBLAS Level 1 swap operation, which exchanges elements between two HIP vectors. The Level 1 API defines operations between vectors. ### Application flow + 1. Read in command-line parameters. 2. Allocate and initialize host vectors. 3. Compute CPU reference result. @@ -15,23 +17,31 @@ This example shows the use of the rocBLAS Level 1 swap operation, which exchange 9. Validate the output by comparing it to the CPU reference result. ### Command line interface + The application provides the following optional command line arguments: + - `-x` or `--incx`. The stride between consecutive values in the data array that makes up vector $x$, which must be greater than 0. Its default value is 1. + - `-y` or `--incy`. The stride between consecutive values in the data array that makes up vector $y$, which must be greater than 0. Its default value is 1. + - `-n` or `--n`. The number of elements in vectors $x$ and $y$, which must be greater than 0. Its default value is 5. ## Key APIs and Concepts + - rocBLAS is initialized by calling `rocblas_create_handle(rocblas_handle*)` and it is terminated by calling `rocblas_destroy_handle(rocblas_handle)`. + - `rocblas_set_vector(n, elem_size, *x, incx, *y, incy)` is used to copy vectors from host to device memory. `n` is the total number of elements that should be copied, and `elem_size` is the size of a single element in bytes. The elements are copied from `x` to `y`, where the step size between consecutive elements of `x` and `y` is given respectively by `incx` and `incy`. Note that the increment is given in elements, not bytes. Additionally, the step size of either `x`, `y`, or both may also be negative. In this case care must be taken that the correct pointer is passed to `rocblas_set_vector`, as it is not automatically adjusted to the end of the input vector. When `incx` and `incy` are 1, calling this function is equivalent to `hipMemcpy(y, x, n * elem_size, hipMemcpyHostToDevice)`. See the following diagram , which illustrates `rocblas_set_vector(3, sizeof(T), x, incx, y, incy)`: -![set_get_vector.svg](set_get_vector.svg) + ![set_get_vector.svg](set_get_vector.svg) - `rocblas_get_vector(n, elem_size, *x, incx, *y, incy)` is used to copy vectors from device to host memory. Its arguments are similar to `rocblas_set_vector`. Elements are also copied from `x` to `y`. + - `rocblas_Xswap(handle, n, *x, incx, *y, incy)` exchanges elements between vectors `x` and `y`. The two vectors are respectively indexed according to the step increments `incx` and `incy` which are each indexed according to step increments `incx` and `incy` similar to `rocblas_set_vector` and `rocblas_get_vector`. `n` gives the amount of elements that should be exchanged. `X` specifies the data type of the operation, and can be one of `s` (single-precision: `rocblas_float`), `d` (double-precision: `rocblas_double`), `h` (half-precision: `rocblas_half`), `c` (single-precision complex: `rocblas_complex`), or `z` (double-precision complex: `rocblas_double_complex`). ## Demonstrated API Calls ### rocBLAS + - `rocblas_create_handle` - `rocblas_destroy_handle` - `rocblas_get_vector` @@ -44,5 +54,6 @@ The application provides the following optional command line arguments: - `rocblas_status_to_string` ### HIP runtime + - `hipFree` - `hipMalloc` diff --git a/Libraries/rocBLAS/level_2/gemv/README.md b/Libraries/rocBLAS/level_2/gemv/README.md index 3a7b27b1..b7b4db93 100644 --- a/Libraries/rocBLAS/level_2/gemv/README.md +++ b/Libraries/rocBLAS/level_2/gemv/README.md @@ -1,9 +1,11 @@ # rocBLAS Level 2 General Matrix-Vector Product Example ## Description + This example illustrates the use of the rocBLAS Level 2 General Matrix-Vector Product functionality. This operation implements $y = \alpha \cdot A \cdot x + \beta \cdot y$, where $\alpha$ and $\beta$ are scalars, $A$ is an $m \times n$ matrix, $y$ is a vector of $m$ elements, and $x$ is a vector of $n$ elements. Additionally, this operation may optionally perform a (conjugate) transpose before the multiplication is performed. ### Application flow + 1. Read in command-line parameters. 2. Allocate and initialize the host vectors and matrix. 3. Compute CPU reference result. @@ -15,7 +17,9 @@ This example illustrates the use of the rocBLAS Level 2 General Matrix-Vector Pr 9. Validate the output by comparing it to the CPU reference result. ### Command line interface + The application provides the following optional command line arguments: + - `-a` or `--alpha`. The scalar value $\alpha$ used in the GEMV operation. Its default value is 1. - `-b` or `--beta`. The scalar value $\beta$ used in the GEMV operation. Its default value is 1. - `-x` or `--incx`. The stride between consecutive values in the data array that makes up vector $x$, which must be greater than 0. Its default value is 1. @@ -24,17 +28,23 @@ The application provides the following optional command line arguments: - `-m` or `--m`. The number of rows in matrix $A$. ## Key APIs and Concepts + - rocBLAS is initialized by calling `rocblas_create_handle(rocblas_handle*)` and it is terminated by calling `rocblas_destroy_handle(rocblas_handle)`. + - The _pointer mode_ controls whether scalar parameters must be allocated on the host (`rocblas_pointer_mode_host`) or on the device (`rocblas_pointer_mode_device`). It is controlled by `rocblas_set_pointer_mode`. + - `rocblas_Xgemv(handle, trans, m, n, *alpha, *A, lda, *x, incx, *beta, *y, incy)` computes a general matrix-vector product. `m` and `n` specify the dimensions of matrix $A$ _before_ any transpose operation is performed on it. `lda` is the _leading dimension_ of $A$: the number of elements between the starts of columns of $A$. Columns of $A$ are packed in memory. Note that rocBLAS matrices are stored in _column major_ ordering in memory. `x` and `y` specify vectors $x$ and $y$, and `incx` and `incy` denote the increment between consecutive items of the respective vectors in elements. `trans` specifies a matrix operation that may be performed before the matrix-vector product is computed: - - `rocblas_operation_none` specifies that no operation is performed. In this case, $x$ needs to have $n$ elements, and $y$ needs to have $m$ elements. - - `rocblas_operation_transpose` specifies that $A$ should be transposed ($A' = A^T$) before the matrix-vector product is performed. - - `rocblas_operation_conjugate_tranpose` specifies that $A$ should be conjugate transposed ($A' = A^H$) before the matrix-vector product is performed. In this and the previous case, $x$ needs to have $m$ elements, and $y$ needs to have $n$ elements. + + - `rocblas_operation_none` specifies that no operation is performed. In this case, $x$ needs to have $n$ elements, and $y$ needs to have $m$ elements. + - `rocblas_operation_transpose` specifies that $A$ should be transposed ($A' = A^T$) before the matrix-vector product is performed. + - `rocblas_operation_conjugate_tranpose` specifies that $A$ should be conjugate transposed ($A' = A^H$) before the matrix-vector product is performed. In this and the previous case, $x$ needs to have $m$ elements, and $y$ needs to have $n$ elements. + `X` is a placeholder for the data type of the operation and can be either `s` (float: `rocblas_float`) or `d` (double: `rocblas_double`). ## Demonstrated API Calls ### rocBLAS + - `rocblas_create_handle` - `rocblas_destroy_handle` - `rocblas_float` @@ -51,6 +61,7 @@ The application provides the following optional command line arguments: - `rocblas_status_to_string` ### HIP runtime + - `hipFree` - `hipMalloc` - `hipMemcpy` diff --git a/Libraries/rocBLAS/level_2/her/README.md b/Libraries/rocBLAS/level_2/her/README.md index 37b6aa0c..e71398f5 100644 --- a/Libraries/rocBLAS/level_2/her/README.md +++ b/Libraries/rocBLAS/level_2/her/README.md @@ -1,9 +1,11 @@ # rocBLAS Level 2 Hermitian Rank-1 Update Example ## Description + This example showcases the usage of the rocBLAS Level2 Hermitian rank-1 update functionality. Additionally, this example demonstrates the compatible memory layout of three different complex float types (`hipFloatComplex`, `std::complex`, and `rocblas_float_complex`). Vectors of complex numbers can be passed to rocBLAS simply by performing a call to `hipMemcpy` and reinterpreting the respective pointers. ### Application flow + 1. Read in command-line parameters. 2. Allocate and initialize the host vector and matrix. 3. Compute CPU reference result. @@ -15,23 +17,31 @@ This example showcases the usage of the rocBLAS Level2 Hermitian rank-1 update f 9. Validate the output by comparing it to the CPU reference result. ### Command line interface + The application provides the following optional command line arguments: + - `-a` or `--alpha`. The scalar value $\alpha$ used in the HER operation. Its default value is 1. - `-x` or `--incx`. The stride between consecutive values in the data array that makes up vector $x$, which must be greater than 0. Its default value is 1. - `-n` or `--n`. The number of elements in vectors $x$ and $y$, which must be greater than 0. Its default value is 5. ## Key APIs and Concepts + - rocBLAS is initialized by calling `rocblas_create_handle(rocblas_handle*)` and it is terminated by calling `rocblas_destroy_handle(rocblas_handle)`. + - The _pointer mode_ controls whether scalar parameters must be allocated on the host (`rocblas_pointer_mode_host`) or on the device (`rocblas_pointer_mode_device`). It is controlled by `rocblas_set_pointer_mode`. + - `rocblas_[cz]her(handle, uplo, n, *alpha, *x, incx, *A, lda)` computes a Hermitian rank-1 update, defined as $A = A + \alpha \cdot x \cdot x ^ H$, where $A$ is an $n \times n$ Hermitian matrix, and $x$ is a complex vector of $n$ elements. The character matched in `[cz]` denotes the data type of the operation, and can either be `c` (complex float: `rocblas_complex_float`), or `z` (complex double: `rocblas_complex_double`). Because a Hermitian matrix is symmetric over the diagonal, except that the values in the upper triangle are the complex conjugate of the values in the lower triangle, the required work is reduced by only updating a single half of the matrix. The part of the matrix to update is given by `uplo`: `rocblas_fill_upper` indicates that the upper triangle of $A$ should be updated, and `rocblas_fill_lower` indicates that the lower triangle should be updated. Values in the other triangle are not altered. `n` gives the dimensions of $x$ and $A$, and `incx` the increment in elements between items of $x$. `lda` is the _leading dimension_ of $A$: the number of elements between the starts of columns of $A$. The elements of each column of $A$ are packed in memory. Note that rocBLAS matrices are laid out in _column major_ ordering. See the following figure, which illustrates the memory layout of a matrix with 3 rows and 2 columns:
-![matrix-layout.svg](matrix-layout.svg) + + ![matrix-layout.svg](matrix-layout.svg) - `hipFloatComplex`, `std::complex`, and `rocblas_float_complex` have compatible memory layout, and performing a memory copy between values of these types will correctly perform the expected copy. + - `hipCaddf(a, b)` adds `hipFloatComplex` values `a` and `b` element-wise together. This function is from a family of host/device HIP functions which operate on complex values. ## Demonstrated API Calls ### rocBLAS + - `rocblas_cher` - `rocblas_create_handle` - `rocblas_destroy_handle` @@ -49,6 +59,7 @@ The application provides the following optional command line arguments: - `rocblas_status_to_string` ### HIP runtime + - `hipCaddf` - `hipFloatComplex` - `hipFree` diff --git a/Libraries/rocBLAS/level_3/gemm/README.md b/Libraries/rocBLAS/level_3/gemm/README.md index 99e2bce0..6430fcb9 100644 --- a/Libraries/rocBLAS/level_3/gemm/README.md +++ b/Libraries/rocBLAS/level_3/gemm/README.md @@ -1,9 +1,11 @@ # rocBLAS Level 3 Generalized Matrix Multiplication Example ## Description + This example illustrates the use of the rocBLAS Level 3 General Matrix Multiplication. The rocBLAS GEMM performs a matrix--matrix operation as: $C = \alpha \cdot A' \cdot B' + \beta \cdot C$, where $X'$ is one of the following: + - $X' = X$ or - $X' = X^T$ (transpose $X$: $X_{ij}^T = X_{ji}$) or - $X' = X^H$ (Hermitian $X$: $X_{ij}^H = \bar{X_{ji}} $), @@ -13,6 +15,7 @@ $\alpha and $\beta$ are scalars, and $A$, $B$ and $C$ are matrices, with $A'$ an $m \times k$ matrix, $B'$ a $k \times n$ matrix and $C$ an $m \times n$ matrix. ### Application flow + 1. Read in command-line parameters. 2. Set dimension variables of the matrices. 3. Allocate and initialize the host matrices. Set up $B$ matrix as an identity matrix. @@ -27,7 +30,9 @@ $A'$ an $m \times k$ matrix, $B'$ a $k \times n$ matrix and $C$ an $m \times n$ 12. Validate the output by comparing it to the CPU reference result. ### Command line interface + The application provides the following optional command line arguments: + - `-a` or `--alpha`. The scalar value $\alpha$ used in the GEMM operation. Its default value is 1. - `-b` or `--beta`. The scalar value $\beta$ used in the GEMM operation. Its default value is 1. - `-m` or `--m`. The number of rows of matrices $A$ and $C$, which must be greater than 0. Its default value is 5. @@ -35,37 +40,43 @@ The application provides the following optional command line arguments: - `-k` or `--k`. The number of columns of matrix $A$ and rows of matrix $B$, which must be greater than 0. Its default value is 5. ## Key APIs and Concepts + - rocBLAS is initialized by calling `rocblas_create_handle(rocblas_handle*)` and it is terminated by calling `rocblas_destroy_handle(rocblas_handle)`. + - The _pointer mode_ controls whether scalar parameters must be allocated on the host (`rocblas_pointer_mode_host`) or on the device (`rocblas_pointer_mode_device`). It is controlled by `rocblas_set_pointer_mode`. + - `rocblas_[sdhcz]gemm` - Depending on the character matched in `[sdhcz]`, the norm can be obtained with different precisions: - - `s` (single-precision: `rocblas_float`) - - `d` (double-precision: `rocblas_double`) - - `h` (half-precision: `rocblas_half`) - - `c` (single-precision complex: `rocblas_complex`) - - `z` (double-precision complex: `rocblas_double_complex`). - - Input parameters: - - `rocblas_handle handle` - - `rocblas_operation transA`: transformation operator on $A$ matrix - - `rocblas_operation transB`: transformation operator on $B$ matrix - - `rocblas_int m`: number of rows in $A'$ and $C$ matrices - - `rocblas_int n`: number of columns in $B'$ and $C$ matrices - - `rocblas_int k`: number of columns in $A'$ matrix and number of rows in $B'$ matrix - - `const float *alpha`: scalar multiplier of $C$ matrix addition - - `const float *A`: pointer to the $A$ matrix - - `rocblas_int lda`: leading dimension of $A$ matrix - - `const float *B`: pointer to the $B$ matrix - - `rocblas_int ldb`: leading dimension of $B$ matrix - - `const float *beta`: scalar multiplier of the $B \cdot C$ matrix product - - `float *C`: pointer to the $C$ matrix - - `rocblas_int ldc`: leading dimension of $C$ matrix - - Return value: `rocblas_status` + + Depending on the character matched in `[sdhcz]`, the norm can be obtained with different precisions: + - `s` (single-precision: `rocblas_float`) + - `d` (double-precision: `rocblas_double`) + - `h` (half-precision: `rocblas_half`) + - `c` (single-precision complex: `rocblas_complex`) + - `z` (double-precision complex: `rocblas_double_complex`). + + Input parameters: + + - `rocblas_handle handle` + - `rocblas_operation transA`: transformation operator on $A$ matrix + - `rocblas_operation transB`: transformation operator on $B$ matrix + - `rocblas_int m`: number of rows in $A'$ and $C$ matrices + - `rocblas_int n`: number of columns in $B'$ and $C$ matrices + - `rocblas_int k`: number of columns in $A'$ matrix and number of rows in $B'$ matrix + - `const float *alpha`: scalar multiplier of $C$ matrix addition + - `const float *A`: pointer to the $A$ matrix + - `rocblas_int lda`: leading dimension of $A$ matrix + - `const float *B`: pointer to the $B$ matrix + - `rocblas_int ldb`: leading dimension of $B$ matrix + - `const float *beta`: scalar multiplier of the $B \cdot C$ matrix product + - `float *C`: pointer to the $C$ matrix + - `rocblas_int ldc`: leading dimension of $C$ matrix + + Return value: `rocblas_status` ## Demonstrated API Calls ### rocBLAS + - `rocblas_int` - `rocblas_float` - `rocblas_operation` @@ -78,6 +89,7 @@ The application provides the following optional command line arguments: - `rocblas_sgemm` ### HIP runtime + - `hipMalloc` - `hipFree` - `hipMemcpy` diff --git a/Libraries/rocBLAS/level_3/gemm_strided_batched/README.md b/Libraries/rocBLAS/level_3/gemm_strided_batched/README.md index f025c749..cb6027a6 100644 --- a/Libraries/rocBLAS/level_3/gemm_strided_batched/README.md +++ b/Libraries/rocBLAS/level_3/gemm_strided_batched/README.md @@ -1,21 +1,24 @@ # rocBLAS Level 3 Generalized Matrix Multiplication Strided Batched Example ## Description + This example illustrates the use of the rocBLAS Level 3 Strided Batched General Matrix Multiplication. The rocBLAS GEMM STRIDED BATCHED performs a matrix--matrix operation for a _batch_ of matrices as: $C[i] = \alpha \cdot A[i]' \cdot B[i]' + \beta \cdot (C[i])$ for each $i \in [0, batch - 1]$, where $X[i] = X + i \cdot strideX$ is the $i$-th element of the correspondent batch and $X'$ is one of the following: + - $X' = X$ or - $X' = X^T$ (transpose $X$: $X_{ij}^T = X_{ji}$) or - $X' = X^H$ (Hermitian $X$: $X_{ij}^H = \bar X_{ji} $). + In this example the identity is used. $\alpha$ and $\beta$ are scalars, and $A$, $B$ and $C$ are the batches of matrices. For each $i$, $A[i]$, $B[i]$ and $C[i]$ are matrices such that $A_i'$ is an $m \times k$ matrix, $B_i'$ a $k \times n$ matrix and $C_i$ an $m \times n$ matrix. - ### Application flow + 1. Read in command-line parameters. 2. Set dimension variables of the matrices and get batch count and stride. 3. Allocate and initialize the host matrices. Set up $B$ matrix as an identity matrix. @@ -30,7 +33,9 @@ $A_i'$ is an $m \times k$ matrix, $B_i'$ a $k \times n$ matrix and $C_i$ an $m \ 12. Validate the output by comparing it to the CPU reference result. ### Command line interface + The application provides the following optional command line arguments: + - `-a` or `--alpha`. The scalar value $\alpha$ used in the GEMM operation. Its default value is 1. - `-b` or `--beta`. The scalar value $\beta$ used in the GEMM operation. Its default value is 1. - `-c` or `--count`. Batch count. Its default value is 3. @@ -39,46 +44,56 @@ The application provides the following optional command line arguments: - `-k` or `--k`. The number of columns of columns of matrix $A_i$ and rows of $B_i$ ## Key APIs and Concepts + - The performance of a numerical multi-linear algebra code can be heavily increased by using tensor contractions [ [Y. Shi et al., HiPC, pp 193, 2016.](https://doi.org/10.1109/HiPC.2016.031) ], thereby most of the rocBLAS functions have a`_batched` and a `_strided_batched` [ [C. Jhurani and P. Mullowney, JPDP Vol 75, pp 133, 2015.](https://doi.org/10.1016/j.jpdc.2014.09.003) ] extensions.
-We can apply the same multiplication operator for several matrices if we combine them into batched matrices. Batched matrix multiplication has a performance improvement for a large number of small matrices. For a constant stride between matrices, further acceleration is available by strided batched GEMM.
+ + We can apply the same multiplication operator for several matrices if we combine them into batched matrices. Batched matrix multiplication has a performance improvement for a large number of small matrices. For a constant stride between matrices, further acceleration is available by strided batched GEMM. + ![strided-matrix-layout.svg](strided-matrix-layout.svg) + - rocBLAS is initialized by calling `rocblas_create_handle(rocblas_handle*)` and it is terminated by calling `rocblas_destroy_handle(rocblas_handle)`. + - The _pointer mode_ controls whether scalar parameters must be allocated on the host (`rocblas_pointer_mode_host`) or on the device (`rocblas_pointer_mode_device`). It is controlled by `rocblas_set_pointer_mode`. + - `rocblas_stride` strides between matrices or vectors in strided_batched functions. + - `rocblas_[sdhcz]gemm_strided_batched` - Depending on the character matched in `[sdhcz]`, the norm can be obtained with different precisions: - - `s` (single-precision: `rocblas_float`) - - `d` (double-precision: `rocblas_double`) - - `h` (half-precision: `rocblas_half`) - - `c` (single-precision complex: `rocblas_complex`) - - `z` (double-precision complex: `rocblas_double_complex`). - - Input parameters: - - `rocblas_handle handle` - - `rocblas_operation transA`: transformation operator on $A_i$ matrix - - `rocblas_operation transB`: transformation operator on $B_i$ matrix - - `rocblas_int m`: number of rows in $A_i'$ and $C_i$ matrices - - `rocblas_int n`: number of columns in $B_i'$ and $C_i$ matrices - - `rocblas_int k`: number of columns in $A_i'$ matrix and number of rows in $B_i'$ matrix - - `const float *alpha`: scalar multiplier of $C_i$ matrix addition - - `const float *A`: pointer to each $A_i$ matrix - - `rocblas_int lda`: leading dimension of each $A_i$ matrix - - `rocblas_stride stride_a`: stride size for each $A_i$ matrix - - `const float *B`: pointer to each $B_i$ matrix - - `rocblas_int ldb`: leading dimension of each $B_i$ matrix - - `const float *beta`: scalar multiplier of the $B \cdot C$ matrix product - - `rocblas_stride stride_b`: stride size for each $B_i$ matrix - - `float *C`: pointer to each $C_i$ matrix - - `rocblas_int ldc`: leading dimension of each $C_i$ matrix - - `rocblas_stride stride_c`: stride size for each $C_i$ matrix - - `rocblas_int batch_count`: number of matrices - - Return value: `rocblas_status` + Depending on the character matched in `[sdhcz]`, the norm can be obtained with different precisions: + + - `s` (single-precision: `rocblas_float`) + - `d` (double-precision: `rocblas_double`) + - `h` (half-precision: `rocblas_half`) + - `c` (single-precision complex: `rocblas_complex`) + - `z` (double-precision complex: `rocblas_double_complex`). + + Input parameters: + + - `rocblas_handle handle` + - `rocblas_operation transA`: transformation operator on $A_i$ matrix + - `rocblas_operation transB`: transformation operator on $B_i$ matrix + - `rocblas_int m`: number of rows in $A_i'$ and $C_i$ matrices + - `rocblas_int n`: number of columns in $B_i'$ and $C_i$ matrices + - `rocblas_int k`: number of columns in $A_i'$ matrix and number of rows in $B_i'$ matrix + - `const float *alpha`: scalar multiplier of $C_i$ matrix addition + - `const float *A`: pointer to each $A_i$ matrix + - `rocblas_int lda`: leading dimension of each $A_i$ matrix + - `rocblas_stride stride_a`: stride size for each $A_i$ matrix + - `const float *B`: pointer to each $B_i$ matrix + - `rocblas_int ldb`: leading dimension of each $B_i$ matrix + - `const float *beta`: scalar multiplier of the $B \cdot C$ matrix product + - `rocblas_stride stride_b`: stride size for each $B_i$ matrix + - `float *C`: pointer to each $C_i$ matrix + - `rocblas_int ldc`: leading dimension of each $C_i$ matrix + - `rocblas_stride stride_c`: stride size for each $C_i$ matrix + - `rocblas_int batch_count`: number of matrices + + Return value: `rocblas_status` ## Demonstrated API Calls ### rocBLAS + - `rocblas_int` - `rocblas_float` - `rocblas_operation` @@ -92,6 +107,7 @@ We can apply the same multiplication operator for several matrices if we combine - `rocblas_sgemm_strided_batched` ### HIP runtime + - `hipMalloc` - `hipFree` - `hipMemcpy` diff --git a/README.md b/README.md index 1e3f67b0..f68687dd 100644 --- a/README.md +++ b/README.md @@ -3,135 +3,139 @@ A collection of examples to enable new users to start using ROCm. Advanced users may learn about new functionality through our advanced examples. ## Repository Contents + - [AI](https://github.com/ROCm/rocm-examples/tree/develop/AI/MIGraphX/Quantization) Showcases the functionality for executing quantized models using Torch-MIGraphX. - [Applications](https://github.com/ROCm/rocm-examples/tree/develop/Applications/) groups a number of examples ... . - - [bitonic_sort](https://github.com/ROCm/rocm-examples/tree/develop/Applications/bitonic_sort/): Showcases how to order an array of $n$ elements using a GPU implementation of the bitonic sort. - - [convolution](https://github.com/ROCm/rocm-examples/tree/develop/Applications/convolution/): A simple GPU implementation for the calculation of discrete convolutions. - - [floyd_warshall](https://github.com/ROCm/rocm-examples/tree/develop/Applications/floyd_warshall/): Showcases a GPU implementation of the Floyd-Warshall algorithm for finding shortest paths in certain types of graphs. - - [histogram](https://github.com/ROCm/rocm-examples/tree/develop/Applications/histogram/): Histogram over a byte array with memory bank optimization. - - [monte_carlo_pi](https://github.com/ROCm/rocm-examples/tree/develop/Applications/monte_carlo_pi/): Monte Carlo estimation of $\pi$ using hipRAND for random number generation and hipCUB for evaluation. - - [prefix_sum](https://github.com/ROCm/rocm-examples/tree/develop/Applications/prefix_sum/): Showcases a GPU implementation of a prefix sum with a 2-kernel scan algorithm. + - [bitonic_sort](https://github.com/ROCm/rocm-examples/tree/develop/Applications/bitonic_sort/): Showcases how to order an array of $n$ elements using a GPU implementation of the bitonic sort. + - [convolution](https://github.com/ROCm/rocm-examples/tree/develop/Applications/convolution/): A simple GPU implementation for the calculation of discrete convolutions. + - [floyd_warshall](https://github.com/ROCm/rocm-examples/tree/develop/Applications/floyd_warshall/): Showcases a GPU implementation of the Floyd-Warshall algorithm for finding shortest paths in certain types of graphs. + - [histogram](https://github.com/ROCm/rocm-examples/tree/develop/Applications/histogram/): Histogram over a byte array with memory bank optimization. + - [monte_carlo_pi](https://github.com/ROCm/rocm-examples/tree/develop/Applications/monte_carlo_pi/): Monte Carlo estimation of $\pi$ using hipRAND for random number generation and hipCUB for evaluation. + - [prefix_sum](https://github.com/ROCm/rocm-examples/tree/develop/Applications/prefix_sum/): Showcases a GPU implementation of a prefix sum with a 2-kernel scan algorithm. - [Common](https://github.com/ROCm/rocm-examples/tree/develop/Common/) contains common utility functionality shared between the examples. - [HIP-Basic](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/) hosts self-contained recipes showcasing HIP runtime functionality. - - [assembly_to_executable](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/assembly_to_executable): Program and accompanying build systems that show how to manually compile and link a HIP application from host and device code. - - [bandwidth](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/bandwidth): Program that measures memory bandwidth from host to device, device to host, and device to device. - - [bit_extract](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/bit_extract): Program that showcases how to use HIP built-in bit extract. - - [device_globals](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/device_globals): Show cases how to set global variables on the device from the host. - - [device_query](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/device_query): Program that showcases how properties from the device may be queried. - - [dynamic_shared](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/dynamic_shared): Program that showcases how to use dynamic shared memory with the help of a simple matrix transpose kernel. - - [events](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/events/): Measuring execution time and synchronizing with HIP events. - - [gpu_arch](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/gpu_arch/): Program that showcases how to implement GPU architecture-specific code. - - [hello_world](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/hello_world): Simple program that showcases launching kernels and printing from the device. - - [hipify](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/hipify): Simple program and build definitions that showcase automatically converting a CUDA `.cu` source into portable HIP `.hip` source. - - [llvm_ir_to_executable](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/llvm_ir_to_executable): Shows how to create a HIP executable from LLVM IR. - - [inline_assembly](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/inline_assembly/): Program that showcases how to use inline assembly in a portable manner. - - [matrix_multiplication](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/matrix_multiplication/): Multiply two dynamically sized matrices utilizing shared memory. - - [module_api](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/module_api/): Shows how to load and execute a HIP module in runtime. - - [moving_average](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/moving_average/): Simple program that demonstrates parallel computation of a moving average of one-dimensional data. - - [multi_gpu_data_transfer](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/multi_gpu_data_transfer/): Performs two matrix transposes on two different devices (one on each) to showcase how to use peer-to-peer communication among devices. - - [occupancy](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/occupancy/): Shows how to find optimal configuration parameters for a kernel launch with maximum occupancy. - - [opengl_interop](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/opengl_interop): Showcases how to share resources and computation between HIP and OpenGL. - - [runtime_compilation](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/runtime_compilation/): Simple program that showcases how to use HIP runtime compilation (hipRTC) to compile a kernel and launch it on a device. - - [saxpy](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/saxpy/): Implements the $y_i=ax_i+y_i$ kernel and explains basic HIP functionality. - - [shared_memory](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/shared_memory/): Showcases how to use static shared memory by implementing a simple matrix transpose kernel. - - [static_device_library](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/static_device_library): Shows how to create a static library containing device functions, and how to link it with an executable. - - [static_host_library](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/static_host_library): Shows how to create a static library containing HIP host functions, and how to link it with an executable. - - [streams](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/streams/): Program that showcases usage of multiple streams each with their own tasks. - - [texture_management](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/texture_management/): Shows the usage of texture memory. - - [vulkan_interop](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/vulkan_interop): Showcases how to share resources and computation between HIP and Vulkan. - - [warp_shuffle](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/warp_shuffle/): Uses a simple matrix transpose kernel to showcase how to use warp shuffle operations. + - [assembly_to_executable](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/assembly_to_executable): Program and accompanying build systems that show how to manually compile and link a HIP application from host and device code. + - [bandwidth](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/bandwidth): Program that measures memory bandwidth from host to device, device to host, and device to device. + - [bit_extract](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/bit_extract): Program that showcases how to use HIP built-in bit extract. + - [device_globals](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/device_globals): Show cases how to set global variables on the device from the host. + - [device_query](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/device_query): Program that showcases how properties from the device may be queried. + - [dynamic_shared](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/dynamic_shared): Program that showcases how to use dynamic shared memory with the help of a simple matrix transpose kernel. + - [events](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/events/): Measuring execution time and synchronizing with HIP events. + - [gpu_arch](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/gpu_arch/): Program that showcases how to implement GPU architecture-specific code. + - [hello_world](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/hello_world): Simple program that showcases launching kernels and printing from the device. + - [hipify](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/hipify): Simple program and build definitions that showcase automatically converting a CUDA `.cu` source into portable HIP `.hip` source. + - [llvm_ir_to_executable](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/llvm_ir_to_executable): Shows how to create a HIP executable from LLVM IR. + - [inline_assembly](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/inline_assembly/): Program that showcases how to use inline assembly in a portable manner. + - [matrix_multiplication](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/matrix_multiplication/): Multiply two dynamically sized matrices utilizing shared memory. + - [module_api](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/module_api/): Shows how to load and execute a HIP module in runtime. + - [moving_average](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/moving_average/): Simple program that demonstrates parallel computation of a moving average of one-dimensional data. + - [multi_gpu_data_transfer](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/multi_gpu_data_transfer/): Performs two matrix transposes on two different devices (one on each) to showcase how to use peer-to-peer communication among devices. + - [occupancy](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/occupancy/): Shows how to find optimal configuration parameters for a kernel launch with maximum occupancy. + - [opengl_interop](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/opengl_interop): Showcases how to share resources and computation between HIP and OpenGL. + - [runtime_compilation](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/runtime_compilation/): Simple program that showcases how to use HIP runtime compilation (hipRTC) to compile a kernel and launch it on a device. + - [saxpy](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/saxpy/): Implements the $y_i=ax_i+y_i$ kernel and explains basic HIP functionality. + - [shared_memory](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/shared_memory/): Showcases how to use static shared memory by implementing a simple matrix transpose kernel. + - [static_device_library](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/static_device_library): Shows how to create a static library containing device functions, and how to link it with an executable. + - [static_host_library](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/static_host_library): Shows how to create a static library containing HIP host functions, and how to link it with an executable. + - [streams](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/streams/): Program that showcases usage of multiple streams each with their own tasks. + - [texture_management](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/texture_management/): Shows the usage of texture memory. + - [vulkan_interop](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/vulkan_interop): Showcases how to share resources and computation between HIP and Vulkan. + - [warp_shuffle](https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/warp_shuffle/): Uses a simple matrix transpose kernel to showcase how to use warp shuffle operations. - [Dockerfiles](https://github.com/ROCm/rocm-examples/tree/develop/Dockerfiles/) hosts Dockerfiles with ready-to-use environments for the various samples. See [Dockerfiles/README.md](https://github.com/ROCm/rocm-examples/tree/develop/Dockerfiles/README.md) for details. - [Docs](https://github.com/ROCm/rocm-examples/tree/develop/Docs/) - - [CONTRIBUTING.md](https://github.com/ROCm/rocm-examples/tree/develop/Docs/CONTRIBUTING.md) contains information on how to contribute to the examples. + - [CONTRIBUTING.md](https://github.com/ROCm/rocm-examples/tree/develop/Docs/CONTRIBUTING.md) contains information on how to contribute to the examples. - [Libraries](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/) - - [hipBLAS](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipBLAS/) - - [gemm_strided_batched](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipBLAS/gemm_strided_batched/): Showcases the general matrix product operation with strided and batched matrices. - - [her](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipBLAS/her/): Showcases a rank-2 update of a Hermitian matrix with complex values. - - [scal](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipBLAS/scal/): Simple program that showcases vector scaling (SCAL) operation. - - [hipCUB](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipCUB/) - - [device_radix_sort](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipCUB/device_radix_sort/): Simple program that showcases `hipcub::DeviceRadixSort::SortPairs`. - - [device_sum](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipCUB/device_sum/): Simple program that showcases `hipcub::DeviceReduce::Sum`. - - [hipSOLVER](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/) - - [gels](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/gels/): Solve a linear system of the form $A\times X=B$. - - [geqrf](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/geqrf/): Program that showcases how to obtain a QR decomposition with the hipSOLVER API. - - [gesvd](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/gesvd/): Program that showcases how to obtain a singular value decomposition with the hipSOLVER API. - - [getrf](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/getrf): Program that showcases how to perform a LU factorization with hipSOLVER. - - [potrf](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/potrf/): Perform Cholesky factorization and solve linear system with result. - - [syevd](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/syevd/): Program that showcases how to calculate the eigenvalues of a matrix using a divide-and-conquer algorithm in hipSOLVER. - - [syevdx](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/syevdx/): Shows how to compute a subset of the eigenvalues and the corresponding eigenvectors of a real symmetric matrix A using the Compatibility API of hipSOLVER. - - [sygvd](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/sygvd/): Showcases how to obtain a solution $(X, \Lambda)$ for a generalized symmetric-definite eigenvalue problem of the form $A \cdot X = B\cdot X \cdot \Lambda$. - - [syevj](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/syevj): Calculates the eigenvalues and eigenvectors from a real symmetric matrix using the Jacobi method. - - [syevj_batched](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/syevj_batched): Showcases how to compute the eigenvalues and eigenvectors (via Jacobi method) of each matrix in a batch of real symmetric matrices. - - [sygvj](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/sygvj): Calculates the generalized eigenvalues and eigenvectors from a pair of real symmetric matrices using the Jacobi method. - - [rocBLAS](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/) - - [level_1](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_1/): Operations between vectors and vectors. - - [axpy](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_1/axpy/): Simple program that showcases the AXPY operation. - - [dot](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_1/dot/): Simple program that showcases dot product. - - [nrm2](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_1/nrm2/): Simple program that showcases Euclidean norm of a vector. - - [scal](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_1/scal/): Simple program that showcases vector scaling (SCAL) operation. - - [swap](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_1/swap/): Showcases exchanging elements between two vectors. - - [level_2](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_2/): Operations between vectors and matrices. - - [her](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_2/her/): Showcases a rank-1 update of a Hermitian matrix with complex values. - - [gemv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_2/gemv/): Showcases the general matrix-vector product operation. - - [level_3](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_3/): Operations between matrices and matrices. - - [gemm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_3/gemm/): Showcases the general matrix product operation. - - [gemm_strided_batched](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_3/gemm_strided_batched/): Showcases the general matrix product operation with strided and batched matrices. - - [rocPRIM](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocPRIM/) - - [block_sum](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocPRIM/block_sum/): Simple program that showcases `rocprim::block_reduce` with an addition operator. - - [device_sum](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocPRIM/device_sum/): Simple program that showcases `rocprim::reduce` with an addition operator. - - [rocRAND](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocRAND/) - - [simple_distributions_cpp](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocRAND/simple_distributions_cpp/): A command-line app to compare random number generation on the CPU and on the GPU with rocRAND. - - [rocSOLVER](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSOLVER/) - - [getf2](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSOLVER/getf2): Program that showcases how to perform a LU factorization with rocSOLVER. - - [getri](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSOLVER/getri): Program that showcases matrix inversion by LU-decomposition using rocSOLVER. - - [syev](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSOLVER/syev): Shows how to compute the eigenvalues and eigenvectors from a symmetrical real matrix. - - [syev_batched](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSOLVER/syev_batched): Shows how to compute the eigenvalues and eigenvectors for each matrix in a batch of real symmetric matrices. - - [syev_strided_batched](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSOLVER/syev_strided_batched): Shows how to compute the eigenvalues and eigenvectors for multiple symmetrical real matrices, that are stored with an arbitrary stride. - - [rocSPARSE](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/) - - [level_2](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/): Operations between sparse matrices and dense vectors. - - [bsrmv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/bsrmv/): Showcases a sparse matrix-vector multiplication using BSR storage format. - - [bsrxmv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/bsrxmv/): Showcases a masked sparse matrix-vector multiplication using BSR storage format. - - [bsrsv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/bsrsv/): Showcases how to solve a linear system of equations whose coefficients are stored in a BSR sparse triangular matrix. - - [coomv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/coomv/): Showcases a sparse matrix-vector multiplication using COO storage format. - - [csrmv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/csrmv/): Showcases a sparse matrix-vector multiplication using CSR storage format. - - [csrsv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/csrsv/): Showcases how to solve a linear system of equations whose coefficients are stored in a CSR sparse triangular matrix. - - [ellmv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/ellmv/): Showcases a sparse matrix-vector multiplication using ELL storage format. - - [gebsrmv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/gebsrmv/): Showcases a sparse matrix-dense vector multiplication using GEBSR storage format. - - [gemvi](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/gemvi/): Showcases a dense matrix-sparse vector multiplication. - - [spmv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/spmv/): Showcases a general sparse matrix-dense vector multiplication. - - [spsv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/spsv/): Showcases how to solve a linear system of equations whose coefficients are stored in a sparse triangular matrix. - - [level_3](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/): Operations between sparse and dense matrices. - - [bsrmm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/bsrmm/): Showcases a sparse matrix-matrix multiplication using BSR storage format. - - [bsrsm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/bsrsm): Showcases how to solve a linear system of equations whose coefficients are stored in a BSR sparse triangular matrix, with solution and right-hand side stored in dense matrices. - - [csrmm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/csrmm/): Showcases a sparse matrix-matrix multiplication using CSR storage format. - - [csrsm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/csrsm): Showcases how to solve a linear system of equations whose coefficients are stored in a CSR sparse triangular matrix, with solution and right-hand side stored in dense matrices. - - [gebsrmm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/gebsrmm/): Showcases a sparse matrix-matrix multiplication using GEBSR storage format. - - [gemmi](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/gemmi/): Showcases a dense matrix sparse matrix multiplication using CSR storage format. - - [sddmm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/sddmm/): Showcases a sampled dense-dense matrix multiplication using CSR storage format. - - [spmm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/spmm/): Showcases a sparse matrix-dense matrix multiplication. - - [spsm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/spsm/): Showcases a sparse triangular linear system solver using CSR storage format. - - [preconditioner](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/preconditioner/): Manipulations on sparse matrices to obtain sparse preconditioner matrices. - - [bsric0](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/preconditioner/bsric0/): Shows how to compute the incomplete Cholesky decomposition of a Hermitian positive-definite sparse BSR matrix. - - [bsrilu0](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/preconditioner/bsrilu0/): Showcases how to obtain the incomplete LU decomposition of a sparse BSR square matrix. - - [csric0](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/preconditioner/csric0/): Shows how to compute the incomplete Cholesky decomposition of a Hermitian positive-definite sparse CSR matrix. - - [csrilu0](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/preconditioner/csrilu0/): Showcases how to obtain the incomplete LU decomposition of a sparse CSR square matrix. - - [csritilu0](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/preconditioner/csritilu0/): Showcases how to obtain iteratively the incomplete LU decomposition of a sparse CSR square matrix. - - [rocThrust](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/) - - [device_ptr](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/device_ptr/): Simple program that showcases the usage of the `thrust::device_ptr` template. - - [norm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/norm/): An example that computes the Euclidean norm of a `thrust::device_vector`. - - [reduce_sum](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/reduce_sum/): An example that computes the sum of a `thrust::device_vector` integer vector using the `thrust::reduce()` generalized summation and the `thrust::plus` operator. - - [remove_points](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/remove_points/): Simple program that demonstrates the usage of the `thrust` random number generation, host vector, generation, tuple, zip iterator, and conditional removal templates. It generates a number of random points in a unit square and then removes all of them outside the unit circle. - - [saxpy](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/saxpy/): Simple program that implements the SAXPY operation (`y[i] = a * x[i] + y[i]`) using rocThrust and showcases the usage of the vector and functor templates and of `thrust::fill` and `thrust::transform` operations. - - [vectors](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/vectors/): Simple program that showcases the `host_vector` and the `device_vector` of rocThrust. + - [hipBLAS](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipBLAS/) + - [gemm_strided_batched](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipBLAS/gemm_strided_batched/): Showcases the general matrix product operation with strided and batched matrices. + - [her](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipBLAS/her/): Showcases a rank-2 update of a Hermitian matrix with complex values. + - [scal](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipBLAS/scal/): Simple program that showcases vector scaling (SCAL) operation. + - [hipCUB](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipCUB/) + - [device_radix_sort](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipCUB/device_radix_sort/): Simple program that showcases `hipcub::DeviceRadixSort::SortPairs`. + - [device_sum](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipCUB/device_sum/): Simple program that showcases `hipcub::DeviceReduce::Sum`. + - [hipSOLVER](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/) + - [gels](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/gels/): Solve a linear system of the form $A\times X=B$. + - [geqrf](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/geqrf/): Program that showcases how to obtain a QR decomposition with the hipSOLVER API. + - [gesvd](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/gesvd/): Program that showcases how to obtain a singular value decomposition with the hipSOLVER API. + - [getrf](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/getrf): Program that showcases how to perform a LU factorization with hipSOLVER. + - [potrf](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/potrf/): Perform Cholesky factorization and solve linear system with result. + - [syevd](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/syevd/): Program that showcases how to calculate the eigenvalues of a matrix using a divide-and-conquer algorithm in hipSOLVER. + - [syevdx](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/syevdx/): Shows how to compute a subset of the eigenvalues and the corresponding eigenvectors of a real symmetric matrix A using the Compatibility API of hipSOLVER. + - [sygvd](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/sygvd/): Showcases how to obtain a solution $(X, \Lambda)$ for a generalized symmetric-definite eigenvalue problem of the form $A \cdot X = B\cdot X \cdot \Lambda$. + - [syevj](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/syevj): Calculates the eigenvalues and eigenvectors from a real symmetric matrix using the Jacobi method. + - [syevj_batched](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/syevj_batched): Showcases how to compute the eigenvalues and eigenvectors (via Jacobi method) of each matrix in a batch of real symmetric matrices. + - [sygvj](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/hipSOLVER/sygvj): Calculates the generalized eigenvalues and eigenvectors from a pair of real symmetric matrices using the Jacobi method. + - [rocBLAS](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/) + - [level_1](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_1/): Operations between vectors and vectors. + - [axpy](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_1/axpy/): Simple program that showcases the AXPY operation. + - [dot](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_1/dot/): Simple program that showcases dot product. + - [nrm2](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_1/nrm2/): Simple program that showcases Euclidean norm of a vector. + - [scal](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_1/scal/): Simple program that showcases vector scaling (SCAL) operation. + - [swap](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_1/swap/): Showcases exchanging elements between two vectors. + - [level_2](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_2/): Operations between vectors and matrices. + - [her](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_2/her/): Showcases a rank-1 update of a Hermitian matrix with complex values. + - [gemv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_2/gemv/): Showcases the general matrix-vector product operation. + - [level_3](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_3/): Operations between matrices and matrices. + - [gemm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_3/gemm/): Showcases the general matrix product operation. + - [gemm_strided_batched](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocBLAS/level_3/gemm_strided_batched/): Showcases the general matrix product operation with strided and batched matrices. + - [rocPRIM](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocPRIM/) + - [block_sum](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocPRIM/block_sum/): Simple program that showcases `rocprim::block_reduce` with an addition operator. + - [device_sum](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocPRIM/device_sum/): Simple program that showcases `rocprim::reduce` with an addition operator. + - [rocRAND](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocRAND/) + - [simple_distributions_cpp](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocRAND/simple_distributions_cpp/): A command-line app to compare random number generation on the CPU and on the GPU with rocRAND. + - [rocSOLVER](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSOLVER/) + - [getf2](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSOLVER/getf2): Program that showcases how to perform a LU factorization with rocSOLVER. + - [getri](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSOLVER/getri): Program that showcases matrix inversion by LU-decomposition using rocSOLVER. + - [syev](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSOLVER/syev): Shows how to compute the eigenvalues and eigenvectors from a symmetrical real matrix. + - [syev_batched](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSOLVER/syev_batched): Shows how to compute the eigenvalues and eigenvectors for each matrix in a batch of real symmetric matrices. + - [syev_strided_batched](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSOLVER/syev_strided_batched): Shows how to compute the eigenvalues and eigenvectors for multiple symmetrical real matrices, that are stored with an arbitrary stride. + - [rocSPARSE](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/) + - [level_2](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/): Operations between sparse matrices and dense vectors. + - [bsrmv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/bsrmv/): Showcases a sparse matrix-vector multiplication using BSR storage format. + - [bsrxmv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/bsrxmv/): Showcases a masked sparse matrix-vector multiplication using BSR storage format. + - [bsrsv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/bsrsv/): Showcases how to solve a linear system of equations whose coefficients are stored in a BSR sparse triangular matrix. + - [coomv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/coomv/): Showcases a sparse matrix-vector multiplication using COO storage format. + - [csrmv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/csrmv/): Showcases a sparse matrix-vector multiplication using CSR storage format. + - [csrsv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/csrsv/): Showcases how to solve a linear system of equations whose coefficients are stored in a CSR sparse triangular matrix. + - [ellmv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/ellmv/): Showcases a sparse matrix-vector multiplication using ELL storage format. + - [gebsrmv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/gebsrmv/): Showcases a sparse matrix-dense vector multiplication using GEBSR storage format. + - [gemvi](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/gemvi/): Showcases a dense matrix-sparse vector multiplication. + - [spmv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/spmv/): Showcases a general sparse matrix-dense vector multiplication. + - [spsv](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_2/spsv/): Showcases how to solve a linear system of equations whose coefficients are stored in a sparse triangular matrix. + - [level_3](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/): Operations between sparse and dense matrices. + - [bsrmm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/bsrmm/): Showcases a sparse matrix-matrix multiplication using BSR storage format. + - [bsrsm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/bsrsm): Showcases how to solve a linear system of equations whose coefficients are stored in a BSR sparse triangular matrix, with solution and right-hand side stored in dense matrices. + - [csrmm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/csrmm/): Showcases a sparse matrix-matrix multiplication using CSR storage format. + - [csrsm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/csrsm): Showcases how to solve a linear system of equations whose coefficients are stored in a CSR sparse triangular matrix, with solution and right-hand side stored in dense matrices. + - [gebsrmm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/gebsrmm/): Showcases a sparse matrix-matrix multiplication using GEBSR storage format. + - [gemmi](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/gemmi/): Showcases a dense matrix sparse matrix multiplication using CSR storage format. + - [sddmm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/sddmm/): Showcases a sampled dense-dense matrix multiplication using CSR storage format. + - [spmm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/spmm/): Showcases a sparse matrix-dense matrix multiplication. + - [spsm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/level_3/spsm/): Showcases a sparse triangular linear system solver using CSR storage format. + - [preconditioner](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/preconditioner/): Manipulations on sparse matrices to obtain sparse preconditioner matrices. + - [bsric0](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/preconditioner/bsric0/): Shows how to compute the incomplete Cholesky decomposition of a Hermitian positive-definite sparse BSR matrix. + - [bsrilu0](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/preconditioner/bsrilu0/): Showcases how to obtain the incomplete LU decomposition of a sparse BSR square matrix. + - [csric0](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/preconditioner/csric0/): Shows how to compute the incomplete Cholesky decomposition of a Hermitian positive-definite sparse CSR matrix. + - [csrilu0](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/preconditioner/csrilu0/): Showcases how to obtain the incomplete LU decomposition of a sparse CSR square matrix. + - [csritilu0](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocSPARSE/preconditioner/csritilu0/): Showcases how to obtain iteratively the incomplete LU decomposition of a sparse CSR square matrix. + - [rocThrust](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/) + - [device_ptr](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/device_ptr/): Simple program that showcases the usage of the `thrust::device_ptr` template. + - [norm](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/norm/): An example that computes the Euclidean norm of a `thrust::device_vector`. + - [reduce_sum](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/reduce_sum/): An example that computes the sum of a `thrust::device_vector` integer vector using the `thrust::reduce()` generalized summation and the `thrust::plus` operator. + - [remove_points](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/remove_points/): Simple program that demonstrates the usage of the `thrust` random number generation, host vector, generation, tuple, zip iterator, and conditional removal templates. It generates a number of random points in a unit square and then removes all of them outside the unit circle. + - [saxpy](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/saxpy/): Simple program that implements the SAXPY operation (`y[i] = a * x[i] + y[i]`) using rocThrust and showcases the usage of the vector and functor templates and of `thrust::fill` and `thrust::transform` operations. + - [vectors](https://github.com/ROCm/rocm-examples/tree/develop/Libraries/rocThrust/vectors/): Simple program that showcases the `host_vector` and the `device_vector` of rocThrust. ## Prerequisites + ### Linux + - [CMake](https://cmake.org/download/) (at least version 3.21) - A number of examples also support building via GNU Make - available through the distribution's package manager - [ROCm](https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.1.3/page/Overview_of_ROCm_Installation_Methods.html) (at least version 5.x.x) - For example-specific prerequisites, see the example subdirectories. ### Windows + - [Visual Studio](https://visualstudio.microsoft.com/) 2019 or 2022 with the "Desktop Development with C++" workload - ROCm toolchain for Windows (No public release yet) - The Visual Studio ROCm extension needs to be installed to build with the solution files. @@ -139,11 +143,15 @@ A collection of examples to enable new users to start using ROCm. Advanced users - [Ninja](https://ninja-build.org/) (optional, to build with CMake) ## Building the example suite + ### Linux + These instructions assume that the prerequisites for every example are installed on the system. #### CMake + See [CMake build options](#cmake-build-options) for an overview of build options. + - `$ git clone https://github.com/ROCm/rocm-examples.git` - `$ cd rocm-examples` - `$ cmake -S . -B build` (on ROCm) or `$ cmake -S . -B build -D GPU_RUNTIME=CUDA` (on CUDA) @@ -151,15 +159,19 @@ See [CMake build options](#cmake-build-options) for an overview of build options - `$ cmake --install build --prefix install` #### Make + Beware that only a subset of the examples support building via Make. + - `$ git clone https://github.com/ROCm/rocm-examples.git` - `$ cd rocm-examples` - `$ make` (on ROCm) or `$ make GPU_RUNTIME=CUDA` (on CUDA) ### Linux with Docker + Alternatively, instead of installing the prerequisites on the system, the [Dockerfiles](https://github.com/ROCm/rocm-examples/tree/develop/Dockerfiles/) in this repository can be used to build images that provide all required prerequisites. Note, that the ROCm kernel GPU driver still needs to be installed on the host system. The following instructions showcase building the Docker image and full example suite inside the container using CMake: + - `$ git clone https://github.com/ROCm/rocm-examples.git` - `$ cd rocm-examples/Dockerfiles` - `$ docker build . -t rocm-examples -f hip-libraries-rocm-ubuntu.Dockerfile` (on ROCm) or `$ docker build . -t rocm-examples -f hip-libraries-cuda-ubuntu.Dockerfile` (on CUDA) @@ -170,11 +182,15 @@ The following instructions showcase building the Docker image and full example s - `# cmake --build build` The built executables can be found and run in the `build` directory: + - `# ./build/Libraries/rocRAND/simple_distributions_cpp/simple_distributions_cpp` ### Windows + #### Visual Studio + The repository has Visual Studio project files for all examples and individually for each example. + - Project files for Visual Studio are named as the example with `_vs` suffix added e.g. `device_sum_vs2019.sln` for the device sum example. - The project files can be built from Visual Studio or from the command line using MSBuild. - Use the build solution command in Visual Studio to build. @@ -185,6 +201,7 @@ The repository has Visual Studio project files for all examples and individually - The top level solution files come in two flavors: `ROCm-Examples-VS.sln` and `ROCm-Examples-Portable-VS.sln`. The former contains all examples, while the latter contains the examples that support both ROCm and CUDA. #### CMake + First, clone the repository and go to the source directory. ```shell @@ -195,6 +212,7 @@ cd rocm-examples There are two ways to build the project using CMake: with the Visual Studio Developer Command Prompt (recommended) or with a standard Command Prompt. See [CMake build options](#cmake-build-options) for an overview of build options. ##### Visual Studio Developer Command Prompt + Select Start, search for "x64 Native Tools Command Prompt for VS 2019", and the resulting Command Prompt. Ninja must be selected as generator, and Clang as C++ compiler. ```shell @@ -203,6 +221,7 @@ cmake --build build ``` ##### Standard Command Prompt + Run the standard Command Prompt. When using the standard Command Prompt to build the project, the Resource Compiler (RC) path must be specified. The RC is a tool used to build Windows-based applications, its default path is `C:/Program Files (x86)/Windows Kits/10/bin//x64/rc.exe`. Finally, the generator must be set to Ninja. ```shell @@ -211,6 +230,7 @@ cmake --build build ``` ### CMake build options + The following options are available when building with CMake. | Option | Relevant to | Default value | Description | |:---------------------------|:------------|:-----------------|:--------------------------------------------------------------------------------------------------------|