GitHub

The goal of my project is to determine the best/easiest way to obtain maximum performance for stenciled functions simultaneously for CPUs and GPUs using Kokkos. Explicit methods such as finited difference and finite volume for MHD/hydrodynamics on structured meshes can typically be decomposed into a series of stenciled functions applied to the mesh. Rather than write an entire MHD code using several different approaches, I want to focus on a simple function, such as a centered finite derivative. Depending on time constraints, I will implement this hypothetical dieal approach in the hydro code Kathena and compare to a naive Kokkos implementation and a CPU optimized Athena version.

Basic types of functions I want to test:

Stenciless operations

Centered 1D stencil operations, in X, Y, and Z.

Small and Large Stencils

High and Low arthmetic intensities

High and Low register use (simple and complex functions)

The main questions I want answered are:

What is the fastest CPU implementation?

What is the fastest Kokkos implementation for CPUs?

What is the fastest Kokkos implementation for GPUs?

What is the fastest Kokkos implementation for both CPUs and GPUs?

What is the fastest CUDA implementation for GPUs?

What is the easiest+fast Kokkos implementation for both CPUs and GPUs?

How do all of these implementations compare?

Specific approaches I want to try/Questions/Random Notes

Varying problem size Varying register use?

How to get SIMD computations on the CPU?

Can Kokkos give the same performance as CUDA?

Does UVM incur a performance penalty? Even if the data stays on the GPU? In CUDA? In Kokkos? How big a performance penalty is there if data is transferred on and off the GPU every time step?

How do you best utilize the explicit caches? On GPUs? Can you do this with Kokkos? On CPUs with SIMD?

Do aligned data accesses matter? On the GPU? On the CPU?

What if boundaries are computed in a separate function? -On the GPU? -On the CPU? (w/, w/o UVM)

What if boundaries are transferred via MPI? -Between CUDA addresses? -UVM? -Same node? Separate nodes? -cudaAdvise - in Kokkos too?

How does high register use change these answers? (EG. in reconstruction?)

Can OpenACC give the same performance as Kokkos? As CUDA?

Do these same considerations apply for the FPGAs?

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
common		common
cpu		cpu
cuda		cuda
data		data
kokkos		kokkos
notebooks		notebooks
openacc		openacc
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

forrestglines/stencil_optimizations

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages