-
Notifications
You must be signed in to change notification settings - Fork 99
BLAS
Kokkos Kernels strive to make available most BLAS capabilities to users and even more through extensions. Our native implementations using the Kokkos Core library capabilities are providing portability and performance which allows users to explore the performance of early hardware where no other implementation is available. Additional we support a large third party layer that wraps specialized and high performance implementation of the BLAS specification.
Our implementations are heavily leveraging the capabilities of Kokkos::View
to simplify the interface of the standard BLAS specification. A Kokkos::View
should be thought of as a tensor, depending on its rank
it can represent a scalar value, a vector or a matrix. Each rank has an associated extent
that specifies the number of values stored by the rank and a stride
that records the leading dimension of the rank. Having these encoded in the View means that Kokkos Kernels does not require the user to specify dimension of vector or matrices neither information like leading dimension of matrices. Another information carried by the View is the notion of increment used to step by a given spacing in a vector. That increment is encoded in the layout of the View using a LayoutStride
. Below we show the signature of the gemv function of standard BLAS and Kokkos Kernels to illustrate our explanations:
dgemv(character TRANS,
integer M,
integer N,
double precision ALPHA,
double precision, dimension(lda,*) A,
integer LDA,
double precision, dimension(*) X,
integer INCX,
double precision BETA,
double precision, dimension(*) Y,
integer INCY)
template <class ExecutionSpace, class AViewType, class XViewType, class YViewType>
void gemv(const ExecutionSpace& space,
const char trans[],
typename AViewType::const_value_type& alpha,
const AViewType& A,
const XViewType& x,
typename YViewType::const_value_type& beta,
const YViewType& y)
Now we provide an example of LayoutStride usage to compute the nrm1 of the odd entries and even entries in a vector, of course the example can be adapted for larger increments.
#include "Kokkos_Core.hpp"
#include "KokkosBlas1_nrm1.hpp"
int main(int argc, char* argv[]) {
Kokkos::initialize(argc, argv);
{
using vector = Kokkos::View<double*, Kokkos::DefaultExecutionSpace>;
// Create a vector of length 20.
// Fill it with the value 2.
// Compute its norm1: 20*2 = 40.
vector a("a", 20);
Kokkos::deep_copy(a, 2);
auto nrm1_a = KokkosBlas::nrm1(a);
// Create a strided layout of length 10 and stride 2
// Make a new view using the data from vector a and the new layout
// Set the values in the even vector to 1
// Compute the norm1 of the even vector: 10*1 = 10
Kokkos::LayoutStride layout(10, 2);
Kokkos::View<double*, Kokkos::LayoutStride, Kokkos::DefaultExecutionSpace> even_a(a.data(), layout);
Kokkos::deep_copy(even_a, 1);
auto nrm1_even_a = KokkosBlas::nrm1(even_a);
// Repeat the operations to create an odd vector using a pointer offset and the layout
// Set the values of the odd vector to 3
// Compute the norm1 of the odd vector: 10*3 = 30
Kokkos::View<double*, Kokkos::LayoutStride, Kokkos::DefaultExecutionSpace> odd_a(a.data()+1, layout);
Kokkos::deep_copy(odd_a, 3);
auto nrm1_odd_a = KokkosBlas::nrm1(odd_a);
// Print all the norms out and the values in the original vector.
// Note that to do this with device views you will need to create
// mirror views and copy data from device to host with deep_copy.
std::cout << "nrm1(a)=" << nrm1_a << ", nrm1(even_a)=" << nrm1_even_a << ", nrm1(odd_a)=" << nrm1_odd_a << std::endl;
std::cout << "a={ ";
for(int idx = 0; idx < a.extent_int(0); ++idx) {
std::cout << a(idx) << " ";
}
std::cout << "}" << std::endl;
}
}
Kokkos::finalize();
}
Kokkos Kernels relies on template parameters to specify the type of input and output variables, this allows us to have generic implementation that are not precision specific. This is reflected in the name of our functions that do not include the usual S
, D
, C
and D
letters that identify the parameters type in reference BLAS. While this makes our naming scheme a little simpler, the use of template parameters allows us to naturally support multi-precision operations. For instance the gemv
function above can be called with vectors of double
and a matrix of float
.
We should make a rule as to which precision will be used for accumulation and stick to it in our implementation. We have at least two possibilities in my mind: use the precision of the output View or use the highest precision of all input parameters and downcast if appropriate when storing the result.
All Kokkos Kernels BLAS calls can be made with an execution space instance as the first parameter. This allows users to attach a stream/queue to the space instance which will be used to launch the BLAS kernel in the appropriate stream/queue. For this strategy to be useful, the BLAS functions are implemented in a non-blocking fashion except when a return value is expected, for instance when computing a norm or a dot product, since we need to obtain the return value before the following kernel can start. Finally while we strive to have a thread safe BLAS implementation, when using an execution space instance to set a select a particular stream/queue to execute a kernel, selecting the queue is not a thread safe operation. More detail on that can be obtained from the vendors documentation: cuBLAS.
Below are tables summarizing the currently supported function calls and third party libraries in Kokkos Kernels. The tables are updated with each release of the library to reflect recently added support.
BLAS Call | Kokkos Kernels Call | Reference | TPL BLAS | TPL CUBLAS | TPL ROCBLAS | TPL oneMKL | Complex Special |
---|---|---|---|---|---|---|---|
SROTG | rotg(a, b, c, s) | X | X | X | X | -- | -- |
SROTMG | rotmg(d1, d2, x1, y1, param) | X | X | X | X | -- | NC |
SROT | rot(X, Y, c, s) | X | X | X | X | -- | -- |
SROTM | rotm(X, Y, param) | X | X | X | -- | -- | NC |
SSWAP | swap(X, Y) | X | X | X | X | -- | N/A |
SSCAL | scal(y,a,x) | X | X | X | X | -- | N/A |
CSSCAL | scal(y,a,x) | X | X | X | X | -- | OC |
SCOPY | deep_copy(y,x) | X | -- | -- | -- | -- | N/A |
SAXPY | axpy(a,x,y) | X | X | X | -- | -- | N/A |
SDOT* | dot(x,y) | X | X | X | X | X | -- |
SDSDOT* | dot(x,y) | X | X | X | X | X | NC |
CDOTU | -- | -- | -- | -- | -- | -- | OC |
CDOTC* | dot(x,y) | X | X | X | X | X | OC |
SNRM2 | nrm2(x) | X | X | X | X | X | NC |
SCNRM2 | nrm2(x) | X | X | X | X | X | OC |
SASUM | nrm1(x) | X | X | X | X | X | N/A |
ISAMAX | iamax(x) | X | X | X | X | -- | N/A |
Kokkos Kernels Call | Reference | TPL BLAS | TPL CUBLAS | TPL ROCBLAS | TPL oneMKL | Complex Special |
---|---|---|---|---|---|---|
axpby | -- | -- | -- | -- | -- | -- |
fill | -- | -- | -- | -- | -- | -- |
mult | -- | -- | -- | -- | -- | -- |
nrminf | -- | -- | -- | -- | -- | -- |
nrm2_squared | -- | -- | -- | -- | -- | -- |
nrm2w | -- | -- | -- | -- | -- | -- |
nrm2w_squared | -- | -- | -- | -- | -- | -- |
reciprocal | -- | -- | -- | -- | -- | -- |
sum | -- | -- | -- | -- | -- | -- |
update | -- | -- | -- | -- | -- | -- |
BLAS Call | Kokkos Kernels Call | Reference | TPL BLAS | TPL CUBLAS | TPL ROCBLAS | TPL oneMKL | Complex Special |
---|---|---|---|---|---|---|---|
SGEMV | gemv(trans,a,A,x,b,y) | X | X | X | X | X | N/A |
SGBMV | -- | -- | -- | -- | -- | -- | -- |
SSYMV | -- | -- | -- | -- | -- | -- | -- |
SSBMV | -- | -- | -- | -- | -- | -- | -- |
SSPMV | -- | -- | -- | -- | -- | -- | -- |
STRMV | trmm(side,uplo,trans, diag,alpha,A,B) | X | X | X | -- | -- | -- |
STBMV | -- | -- | -- | -- | -- | -- | -- |
STPMV | -- | -- | -- | -- | -- | -- | -- |
STRSV | trsm(side,uplo,trans, diag,alpha,A,B) | X | X | X | -- | -- | -- |
STBSV | -- | -- | -- | -- | -- | -- | -- |
STPSV | -- | -- | -- | -- | -- | -- | -- |
SGER | ger(trans,a,x,y,A) | X | X | X | X | -- | NC |
CGERU | ger(trans,a,x,y,A) | X | X | X | X | -- | OC |
CGERC | ger(trans,a,x,y,A) | X | X | X | X | -- | OC |
SSYR | syr(trans,uplo,a,x,A) | X | X | X | X | -- | -- |
SSPR | -- | -- | -- | -- | -- | -- | -- |
SSYR2 | syr2(trans,uplo,a,x,y,A) | X | X | X | X | -- | -- |
SSPR2 | -- | -- | -- | -- | -- | -- | -- |
BLAS Call | Kokkos Kernels Call | Reference | TPL BLAS | TPL CUBLAS | TPL ROCBLAS | TPL oneMKL | Complex Special |
---|---|---|---|---|---|---|---|
SGEMM | gemm(transA,transB,a,A,B,b,C) | X | X | X | X | -- | N/A |
SSYMM | -- | -- | -- | -- | -- | -- | -- |
SSYRK | -- | -- | -- | -- | -- | -- | -- |
SSYR2K | -- | -- | -- | -- | -- | -- | -- |
CHEMM | -- | -- | -- | -- | -- | -- | OC |
CHERK | -- | -- | -- | -- | -- | -- | OC |
CHER2K | -- | -- | -- | -- | -- | -- | OC |
STRMM | trmm(side, uplo, trans, diag, a, A, B) | X | X | X | -- | -- | -- |
STRSM | trsm(side, uplo, trans, diag, a, A, B) | X | X | X | -- | -- | -- |
Functions which use a full execution space instance:
-
abs Element wise absolute value.
y[i] = |x[i]|
-
axpy Vector addition
y[i] = y[i] + alpha * x[i]
-
axpby Vector addition
y[i] = beta * y[i] + alpha * x[i]
-
dot Inner product.
delta = SUM_i(y[i] * x[i])
-
fill Setting each element of a vector.
x[i] = alpha
-
mult Elememt wise multiplication of two vectors
y[i] = beta * y[i] + alpha * a[i] * x[i]
-
nrm1 1-Norm of a vector.
nrm1 = SUM_i( |x[i]| )
-
nrm2 2-Norm of a vector.
nrm2 = SQRT(SUM_i( |x[i]| * |x[i]| ))
-
nrm2_squared Square of a 2-Norm of a vector.
nrm2 = SUM_i( |x[i]| * |x[i]| )
-
nrm2w Weighted 2-Norm of a vector.
nrm2 = SQRT(SUM_i( |x[i]| / |w[i]| * |x[i]| / |w[i]| ))
-
nrm2w_squared Square of a weighted 2-Norm of a vector.
nrm2 = SUM_i( |x[i]| / |w[i]| * |x[i]| / |w[i]| )
-
nrminf Inf-Norm of a vector.
nrminf = MAX_i( |x[i]| )
-
reciprocal Element wise reciprocal of a vector.
y[i] = 1/x[i]
-
scal Element wise multiplication of a vector with a scalar.
y[i] = alpha * x[i]
-
sum Sum of all elements in a vector.
sum = SUM_i( x[i] )
-
update Vector addition
z[i] = gamma * z[i] + alpha * x[i] + beta * y[i]
KokkosKernels allows calling MAGMA functions through KokkosKernels interfaces. The current supported functions are:
-
gesv Linear equation system solve using LU factorization with multiple right-hand sides
A*X = B
(partial pivoting variant and no pivoting variant).- Equivalent MAGMA functions:
magma_sgesv_gpu
,magma_dsgesv_gpu
,magma_cgesv_gpu
,magma_zgesv_gpu
,magma_sgesv_nopiv_gpu
,magma_dsgesv_nopiv_gpu
,magma_cgesv_nopiv_gpu
,magma_zgesv_nopiv_gpu
- Equivalent MAGMA functions:
-
trtri Triangular matrix inverse.
- Equivalent MAGMA functions:
magma_strtri_gpu
,magma_dtrtri_gpu
,magma_ctrtri_gpu
,magma_ztrtri_gpu
.
- Equivalent MAGMA functions:
Usage: Enable the use of TPL MAGMA by adding KOKKOSKERNELS_ENABLE_TPLS=magma
in the Makefile.