diff --git a/README.md b/README.md
index 68e4d3bb8..4f8dc965a 100644
--- a/README.md
+++ b/README.md
@@ -154,7 +154,7 @@ Header-based and backend-independent Device API can be called within ```sycl ker
Supported domains include: BLAS, LAPACK, RNG, DFT, SPARSE_BLAS
Supported compilers include:
-- [Intel(R) oneAPI DPC++ Compiler](https://software.intel.com/en-us/oneapi/dpc-compiler): Intel proprietary compiler that supports CPUs and Intel GPUs. Intel(R) oneAPI DPC++ Compiler will be referred to as "Intel DPC++" in the "Supported Compiler" column of the tables below.
+- [Intel(R) oneAPI DPC++ Compiler](https://software.intel.com/en-us/oneapi/dpc-compiler): Intel proprietary compiler that supports CPUs and Intel GPUs.
- [oneAPI DPC++ Compiler](https://github.com/intel/llvm): Open source compiler that supports CPUs and Intel, NVIDIA, and AMD GPUs. oneAPI DPC++ Compiler will be referred to as "Open DPC++" in the "Supported Compiler" column of the tables below.
- [AdaptiveCpp Compiler](https://github.com/AdaptiveCpp/AdaptiveCpp) (formerly known as hipSYCL): Open source compiler that supports CPUs and Intel, NVIDIA, and AMD GPUs.**Note**: The source code and some documents in this project still use the previous name hipSYCL during this transition period.
@@ -175,28 +175,28 @@ Supported compilers include:
BLAS |
x86 CPU |
Intel(R) oneMKL |
- Intel DPC++AdaptiveCpp |
+ Intel(R) oneAPI DPC++ CompilerAdaptiveCpp |
Dynamic, Static |
NETLIB LAPACK |
- Intel DPC++Open DPC++AdaptiveCpp |
+ Intel(R) oneAPI DPC++ CompilerOpen DPC++AdaptiveCpp |
Dynamic, Static |
portBLAS |
- Intel DPC++Open DPC++ |
+ Intel(R) oneAPI DPC++ CompilerOpen DPC++ |
Dynamic, Static |
Intel GPU |
Intel(R) oneMKL |
- Intel DPC++ |
+ Intel(R) oneAPI DPC++ Compiler |
Dynamic, Static |
portBLAS |
- Intel DPC++Open DPC++ |
+ Intel(R) oneAPI DPC++ CompilerOpen DPC++ |
Dynamic, Static |
@@ -225,13 +225,13 @@ Supported compilers include:
LAPACK |
x86 CPU |
Intel(R) oneMKL |
- Intel DPC++ |
+ Intel(R) oneAPI DPC++ Compiler |
Dynamic, Static |
Intel GPU |
Intel(R) oneMKL |
- Intel DPC++ |
+ Intel(R) oneAPI DPC++ Compiler |
Dynamic, Static |
@@ -250,13 +250,13 @@ Supported compilers include:
RNG |
x86 CPU |
Intel(R) oneMKL |
- Intel DPC++AdaptiveCpp |
+ Intel(R) oneAPI DPC++ CompilerAdaptiveCpp |
Dynamic, Static |
Intel GPU |
Intel(R) oneMKL |
- Intel DPC++ |
+ Intel(R) oneAPI DPC++ Compiler |
Dynamic, Static |
@@ -275,23 +275,23 @@ Supported compilers include:
DFT |
x86 CPU |
Intel(R) oneMKL |
- Intel DPC++ |
+ Intel(R) oneAPI DPC++ Compiler |
Dynamic, Static |
portFFT (limited API support) |
- Intel DPC++ |
+ Intel(R) oneAPI DPC++ Compiler |
Dynamic, Static |
Intel GPU |
Intel(R) oneMKL |
- Intel DPC++ |
+ Intel(R) oneAPI DPC++ Compiler |
Dynamic, Static |
portFFT (limited API support) |
- Intel DPC++ |
+ Intel(R) oneAPI DPC++ Compiler |
Dynamic, Static |
@@ -320,13 +320,13 @@ Supported compilers include:
SPARSE_BLAS |
x86 CPU |
Intel(R) oneMKL |
- Intel DPC++ |
+ Intel(R) oneAPI DPC++ Compiler |
Dynamic, Static |
Intel GPU |
Intel(R) oneMKL |
- Intel DPC++ |
+ Intel(R) oneAPI DPC++ Compiler |
Dynamic, Static |
@@ -349,44 +349,44 @@ Supported compilers include:
BLAS |
x86 CPU |
Intel(R) oneMKL |
- Intel DPC++ |
+ Intel(R) oneAPI DPC++ Compiler |
Dynamic, Static |
NETLIB LAPACK |
- Intel DPC++Open DPC++ |
+ Intel(R) oneAPI DPC++ CompilerOpen DPC++ |
Dynamic, Static |
Intel GPU |
Intel(R) oneMKL |
- Intel DPC++ |
+ Intel(R) oneAPI DPC++ Compiler |
Dynamic, Static |
LAPACK |
x86 CPU |
Intel(R) oneMKL |
- Intel DPC++ |
+ Intel(R) oneAPI DPC++ Compiler |
Dynamic, Static |
Intel GPU |
Intel(R) oneMKL |
- Intel DPC++ |
+ Intel(R) oneAPI DPC++ Compiler |
Dynamic, Static |
RNG |
x86 CPU |
Intel(R) oneMKL |
- Intel DPC++ |
+ Intel(R) oneAPI DPC++ Compiler |
Dynamic, Static |
Intel GPU |
Intel(R) oneMKL |
- Intel DPC++ |
+ Intel(R) oneAPI DPC++ Compiler |
Dynamic, Static |
diff --git a/examples/README.md b/examples/README.md
index 9904a78f2..bb5b6ca16 100644
--- a/examples/README.md
+++ b/examples/README.md
@@ -17,582 +17,25 @@ The example executable naming convention follows `example_<$domain>_<$routine>_<
or `example_<$domain>_<$routine>` for run-time dispatching examples.
E.g. `example_blas_gemm_usm_mklcpu_cublas ` `example_blas_gemm_usm`
-## Example outputs (blas, rng, lapack, dft, sparse_blas)
+## Running examples
## blas
+Below are showcases of how to run examples with different backends using the BLAS domain as an illustration.
+
Run-time dispatching examples with mklcpu backend
```
$ export ONEAPI_DEVICE_SELECTOR="opencl:cpu"
$ ./bin/example_blas_gemm_usm
-
-########################################################################
-# General Matrix-Matrix Multiplication using Unified Shared Memory Example:
-#
-# C = alpha * A * B + beta * C
-#
-# where A, B and C are general dense matrices and alpha, beta are
-# floating point type precision scalars.
-#
-# Using apis:
-# gemm
-#
-# Using single precision (float) data type
-#
-# Device will be selected during runtime.
-# The environment variable ONEAPI_DEVICE_SELECTOR can be used to specify
-# available devices
-#
-########################################################################
-
-Running BLAS GEMM USM example on CPU device.
-Device name is: Intel(R) Core(TM) i7-6770HQ CPU @ 2.60GHz
-Running with single precision real data type:
-
- GEMM parameters:
- transA = trans, transB = nontrans
- m = 45, n = 98, k = 67
- lda = 103, ldB = 105, ldC = 106
- alpha = 2, beta = 3
-
- Outputting 2x2 block of A,B,C matrices:
-
- A = [ 0.340188, 0.260249, ...
- [ -0.105617, 0.0125354, ...
- [ ...
-
-
- B = [ -0.326421, -0.192968, ...
- [ 0.363891, 0.251295, ...
- [ ...
-
-
- C = [ 0.00698781, 0.525862, ...
- [ 0.585167, 1.59017, ...
- [ ...
-
-BLAS GEMM USM example ran OK.
-
```
Run-time dispatching examples with mklgpu backend
```
$ export ONEAPI_DEVICE_SELECTOR="level_zero:gpu"
$ ./bin/example_blas_gemm_usm
-
-########################################################################
-# General Matrix-Matrix Multiplication using Unified Shared Memory Example:
-#
-# C = alpha * A * B + beta * C
-#
-# where A, B and C are general dense matrices and alpha, beta are
-# floating point type precision scalars.
-#
-# Using apis:
-# gemm
-#
-# Using single precision (float) data type
-#
-# Device will be selected during runtime.
-# The environment variable ONEAPI_DEVICE_SELECTOR can be used to specify
-# available devices
-#
-########################################################################
-
-Running BLAS GEMM USM example on GPU device.
-Device name is: Intel(R) Iris(R) Pro Graphics 580 [0x193b]
-Running with single precision real data type:
-
- GEMM parameters:
- transA = trans, transB = nontrans
- m = 45, n = 98, k = 67
- lda = 103, ldB = 105, ldC = 106
- alpha = 2, beta = 3
-
- Outputting 2x2 block of A,B,C matrices:
-
- A = [ 0.340188, 0.260249, ...
- [ -0.105617, 0.0125354, ...
- [ ...
-
-
- B = [ -0.326421, -0.192968, ...
- [ 0.363891, 0.251295, ...
- [ ...
-
-
- C = [ 0.00698781, 0.525862, ...
- [ 0.585167, 1.59017, ...
- [ ...
-
-BLAS GEMM USM example ran OK.
```
Compile-time dispatching example with both mklcpu and cublas backend
(Note that the mklcpu and cublas result matrices have a small difference. This is expected due to precision limitation of `float`)
```
./bin/example_blas_gemm_usm_mklcpu_cublas
-
-########################################################################
-# General Matrix-Matrix Multiplication using Unified Shared Memory Example:
-#
-# C = alpha * A * B + beta * C
-#
-# where A, B and C are general dense matrices and alpha, beta are
-# floating point type precision scalars.
-#
-# Using apis:
-# gemm
-#
-# Using single precision (float) data type
-#
-# Running on both Intel CPU and Nvidia GPU devices
-#
-########################################################################
-
-Running BLAS GEMM USM example
-Running with single precision real data type on:
- CPU device: Intel(R) Core(TM) i9-7920X CPU @ 2.90GHz
- GPU device: TITAN RTX
-
- GEMM parameters:
- transA = trans, transB = nontrans
- m = 45, n = 98, k = 67
- lda = 103, ldB = 105, ldC = 106
- alpha = 2, beta = 3
-
- Outputting 2x2 block of A,B,C matrices:
-
- A = [ 0.340188, 0.260249, ...
- [ -0.105617, 0.0125354, ...
- [ ...
-
-
- B = [ -0.326421, -0.192968, ...
- [ 0.363891, 0.251295, ...
- [ ...
-
-
- (CPU) C = [ 0.00698781, 0.525862, ...
- [ 0.585167, 1.59017, ...
- [ ...
-
-
- (GPU) C = [ 0.00698793, 0.525862, ...
- [ 0.585168, 1.59017, ...
- [ ...
-
-BLAS GEMM USM example ran OK on MKLCPU and CUBLAS
-
-```
-
-## lapack
-Run-time dispatching example with mklgpu backend:
-```
-$ export ONEAPI_DEVICE_SELECTOR="level_zero:gpu"
-$ ./bin/example_lapack_getrs_usm
-
-########################################################################
-# LU Factorization and Solve Example:
-#
-# Computes LU Factorization A = P * L * U
-# and uses it to solve for X in a system of linear equations:
-# AX = B
-# where A is a general dense matrix and B is a matrix whose columns
-# are the right-hand sides for the systems of equations.
-#
-# Using apis:
-# getrf and getrs
-#
-# Using single precision (float) data type
-#
-# Device will be selected during runtime.
-# The environment variable ONEAPI_DEVICE_SELECTOR can be used to specify
-# available devices
-#
-########################################################################
-
-Running LAPACK getrs example on GPU device.
-Device name is: Intel(R) Iris(R) Pro Graphics 580 [0x193b]
-Running with single precision real data type:
-
- GETRF and GETRS parameters:
- trans = nontrans
- m = 23, n = 23, nrhs = 23
- lda = 32, ldb = 32
-
- Outputting 2x2 block of A and X matrices:
-
- A = [ 0.340188, 0.304177, ...
- [ -0.105617, -0.343321, ...
- [ ...
-
-
- X = [ -1.1748, 1.84793, ...
- [ 1.47856, 0.189481, ...
- [ ...
-
-LAPACK GETRS USM example ran OK
-```
-
-Compile-time dispatching example with both mklcpu and cusolver backend
-```
-$ ./bin/example_lapack_getrs_usm_mklcpu_cusolver
-
-########################################################################
-# LU Factorization and Solve Example:
-#
-# Computes LU Factorization A = P * L * U
-# and uses it to solve for X in a system of linear equations:
-# AX = B
-# where A is a general dense matrix and B is a matrix whose columns
-# are the right-hand sides for the systems of equations.
-#
-# Using apis:
-# getrf and getrs
-#
-# Using single precision (float) data type
-#
-# Running on both Intel CPU and NVIDIA GPU devices
-#
-########################################################################
-
-Running LAPACK GETRS USM example
-Running with single precision real data type on:
- CPU device :Intel(R) Core(TM) i9-7920X CPU @ 2.90GHz
- GPU device :TITAN RTX
-
- GETRF and GETRS parameters:
- trans = nontrans
- m = 23, n = 23, nrhs = 23
- lda = 32, ldb = 32
-
- Outputting 2x2 block of A,B,X matrices:
-
- A = [ 0.340188, 0.304177, ...
- [ -0.105617, -0.343321, ...
- [ ...
-
-
- (CPU) X = [ -1.1748, 1.84793, ...
- [ 1.47856, 0.189481, ...
- [ ...
-
-
- (GPU) X = [ -1.1748, 1.84793, ...
- [ 1.47856, 0.189481, ...
- [ ...
-
-LAPACK GETRS USM example ran OK on MKLCPU and CUSOLVER
-
-```
-
-## rng
-Run-time dispatching example with mklgpu backend:
-```
-$ export ONEAPI_DEVICE_SELECTOR="level_zero:gpu"
-$ ./bin/example_rng_uniform_usm
-
-########################################################################
-# Generate uniformly distributed random numbers with philox4x32x10
-# generator example:
-#
-# Using APIs:
-# default_engine uniform
-#
-# Using single precision (float) data type
-#
-# Device will be selected during runtime.
-# The environment variable ONEAPI_DEVICE_SELECTOR can be used to specify
-# available devices
-#
-########################################################################
-
-Running RNG uniform usm example on GPU device
-Device name is: Intel(R) Iris(R) Pro Graphics 580 [0x193b]
-Running with single precision real data type:
- generation parameters:
- seed = 777, a = 0, b = 10
- Output of generator:
- first 10 numbers of 1000:
-8.52971 1.76033 6.04753 3.68079 9.04039 2.61014 3.75788 3.94859 7.93444 8.60436
-Random number generator with uniform distribution ran OK
-
-```
-
-Compile-time dispatching example with both mklcpu and curand backend
-```
-$ ./bin/example_rng_uniform_usm_mklcpu_curand
-
-########################################################################
-# Generate uniformly distributed random numbers with philox4x32x10
-# generator example:
-#
-# Using APIs:
-# default_engine uniform
-#
-# Using single precision (float) data type
-#
-# Running on both Intel CPU and Nvidia GPU devices
-#
-########################################################################
-
-Running RNG uniform usm example
-Running with single precision real data type:
- CPU device: Intel(R) Core(TM) i9-7920X CPU @ 2.90GHz
- GPU device: TITAN RTX
- generation parameters:
- seed = 777, a = 0, b = 10
- Output of generator on CPU device:
- first 10 numbers of 1000:
-8.52971 1.76033 6.04753 3.68079 9.04039 2.61014 3.75788 3.94859 7.93444 8.60436
- Output of generator on GPU device:
- first 10 numbers of 1000:
-3.52971 6.76033 1.04753 8.68079 4.48229 0.501966 6.78265 8.99091 6.39516 9.67955
-Random number generator example with uniform distribution ran OK on MKLCPU and CURAND
-
-```
-
-## dft
-
-Compile-time dispatching example with MKLGPU backend
-
-```none
-$ ONEAPI_DEVICE_SELECTOR="level_zero:gpu" ./bin/example_dft_complex_fwd_buffer_mklgpu
-
-########################################################################
-# Complex out-of-place forward transform for Buffer API's example:
-#
-# Using APIs:
-# Compile-time dispatch API
-# Buffer forward complex out-of-place
-#
-# Using single precision (float) data type
-#
-# For Intel GPU with Intel MKLGPU backend.
-#
-# The environment variable ONEAPI_DEVICE_SELECTOR can be used to specify
-# available devices
-########################################################################
-
-Running DFT Complex forward out-of-place buffer example
-Using compile-time dispatch API with MKLGPU.
-Running with single precision real data type on:
- GPU device :Intel(R) UHD Graphics 750 [0x4c8a]
-DFT Complex USM example ran OK on MKLGPU
-```
-
-Runtime dispatching example with MKLGPU, cuFFT, rocFFT and portFFT backends:
-
-```none
-$ ONEAPI_DEVICE_SELECTOR="level_zero:gpu" ./bin/example_dft_real_fwd_usm
-
-########################################################################
-# DFT complex in-place forward transform with USM API example:
-#
-# Using APIs:
-# USM forward complex in-place
-# Run-time dispatch
-#
-# Using single precision (float) data type
-#
-# Device will be selected during runtime.
-# The environment variable ONEAPI_DEVICE_SELECTOR can be used to specify
-# available devices
-#
-########################################################################
-
-Running DFT complex forward example on GPU device
-Device name is: Intel(R) UHD Graphics 750 [0x4c8a]
-Running with single precision real data type:
-DFT example run_time dispatch
-DFT example ran OK
-```
-
-```none
-$ ONEAPI_DEVICE_SELECTOR="level_zero:gpu" ./bin/example_dft_real_fwd_usm
-
-########################################################################
-# DFT complex in-place forward transform with USM API example:
-#
-# Using APIs:
-# USM forward complex in-place
-# Run-time dispatch
-#
-# Using single precision (float) data type
-#
-# Device will be selected during runtime.
-# The environment variable ONEAPI_DEVICE_SELECTOR can be used to specify
-# available devices
-#
-########################################################################
-
-Running DFT complex forward example on GPU device
-Device name is: NVIDIA A100-PCIE-40GB
-Running with single precision real data type:
-DFT example run_time dispatch
-DFT example ran OK
-```
-
-```none
-$ ./bin/example_dft_real_fwd_usm
-
-########################################################################
-# DFT complex in-place forward transform with USM API example:
-#
-# Using APIs:
-# USM forward complex in-place
-# Run-time dispatch
-#
-# Using single precision (float) data type
-#
-# Device will be selected during runtime.
-# The environment variable ONEAPI_DEVICE_SELECTOR can be used to specify
-# available devices
-#
-########################################################################
-
-Running DFT complex forward example on GPU device
-Device name is: AMD Radeon PRO W6800
-Running with single precision real data type:
-DFT example run_time dispatch
-DFT example ran OK
-```
-
-```none
-$ LD_LIBRARY_PATH=lib/:$LD_LIBRARY_PATH ./bin/example_dft_real_fwd_usm
-########################################################################
-# DFT complex in-place forward transform with USM API example:
-#
-# Using APIs:
-# USM forward complex in-place
-# Run-time dispatch
-#
-# Using single precision (float) data type
-#
-# Device will be selected during runtime.
-# The environment variable ONEAPI_DEVICE_SELECTOR can be used to specify
-# available devices
-#
-########################################################################
-
-Running DFT complex forward example on GPU device
-Device name is: Intel(R) UHD Graphics 750
-Running with single precision real data type:
-DFT example run_time dispatch
-Unsupported Configuration:
- oneMKL: dft/backends/portfft/commit: function is not implemented portFFT only supports complex to complex transforms
-```
-
-## sparse_blas
-
-Run-time dispatching examples with mklcpu backend
-```
-$ export ONEAPI_DEVICE_SELECTOR="opencl:cpu"
-$ ./bin/example_sparse_blas_gemv_usm
-
-########################################################################
-# Sparse Matrix-Vector Multiply Example:
-#
-# y = alpha * op(A) * x + beta * y
-#
-# where A is a sparse matrix in CSR format, x and y are dense vectors
-# and alpha, beta are floating point type precision scalars.
-#
-# Using apis:
-# sparse::gemv
-#
-# Using single precision (float) data type
-#
-# Device will be selected during runtime.
-# The environment variable ONEAPI_DEVICE_SELECTOR can be used to specify
-# available devices
-#
-########################################################################
-
-Running Sparse BLAS GEMV USM example on CPU device.
-Device name is: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
-Running with single precision real data type:
-
- sparse::gemv parameters:
- transA = nontrans
- nrows = 64
- alpha = 1, beta = 0
-
- sparse::gemv example passed
- Finished
-Sparse BLAS GEMV USM example ran OK.
-```
-
-Run-time dispatching examples with mklgpu backend
-```
-$ export ONEAPI_DEVICE_SELECTOR="level_zero:gpu"
-$ ./bin/example_sparse_blas_gemv_usm
-
-########################################################################
-# Sparse Matrix-Vector Multiply Example:
-#
-# y = alpha * op(A) * x + beta * y
-#
-# where A is a sparse matrix in CSR format, x and y are dense vectors
-# and alpha, beta are floating point type precision scalars.
-#
-# Using apis:
-# sparse::gemv
-#
-# Using single precision (float) data type
-#
-# Device will be selected during runtime.
-# The environment variable ONEAPI_DEVICE_SELECTOR can be used to specify
-# available devices
-#
-########################################################################
-
-Running Sparse BLAS GEMV USM example on GPU device.
-Device name is: Intel(R) HD Graphics 530 [0x1912]
-Running with single precision real data type:
-
- sparse::gemv parameters:
- transA = nontrans
- nrows = 64
- alpha = 1, beta = 0
-
- sparse::gemv example passed
- Finished
-Sparse BLAS GEMV USM example ran OK.
-```
-
-Compile-time dispatching example with mklcpu backend
-```
-$ export ONEAPI_DEVICE_SELECTOR="opencl:cpu"
-$ ./bin/example_sparse_blas_gemv_usm_mklcpu
-
-########################################################################
-# Sparse Matrix-Vector Multiply Example:
-#
-# y = alpha * op(A) * x + beta * y
-#
-# where A is a sparse matrix in CSR format, x and y are dense vectors
-# and alpha, beta are floating point type precision scalars.
-#
-# Using apis:
-# sparse::gemv
-#
-# Using single precision (float) data type
-#
-# Running on Intel CPU device
-#
-########################################################################
-
-Running Sparse BLAS GEMV USM example on CPU device.
-Device name is: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
-Running with single precision real data type:
-
- sparse::gemv parameters:
- transA = nontrans
- nrows = 64
- alpha = 1, beta = 0
-
- sparse::gemv example passed
- Finished
-Sparse BLAS GEMV USM example ran OK.
```