Skip to content

forrestglines/stencil_optimizations

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The goal of my project is to determine the best/easiest way to obtain maximum performance for stenciled functions simultaneously for CPUs and GPUs using Kokkos. Explicit methods such as finited difference and finite volume for MHD/hydrodynamics on structured meshes can typically be decomposed into a series of stenciled functions applied to the mesh. Rather than write an entire MHD code using several different approaches, I want to focus on a simple function, such as a centered finite derivative. Depending on time constraints, I will implement this hypothetical dieal approach in the hydro code Kathena and compare to a naive Kokkos implementation and a CPU optimized Athena version.


Basic types of functions I want to test:

Stenciless operations

Centered 1D stencil operations, in X, Y, and Z.

Small and Large Stencils

High and Low arthmetic intensities

High and Low register use (simple and complex functions)


The main questions I want answered are:

What is the fastest CPU implementation?

What is the fastest Kokkos implementation for CPUs?

What is the fastest Kokkos implementation for GPUs?

What is the fastest Kokkos implementation for both CPUs and GPUs?

What is the fastest CUDA implementation for GPUs?

What is the easiest+fast Kokkos implementation for both CPUs and GPUs?

How do all of these implementations compare?


Specific approaches I want to try/Questions/Random Notes

Varying problem size Varying register use?

How to get SIMD computations on the CPU?

Can Kokkos give the same performance as CUDA?

Does UVM incur a performance penalty? Even if the data stays on the GPU? In CUDA? In Kokkos? How big a performance penalty is there if data is transferred on and off the GPU every time step?

How do you best utilize the explicit caches? On GPUs? Can you do this with Kokkos? On CPUs with SIMD?

Do aligned data accesses matter? On the GPU? On the CPU?

What if boundaries are computed in a separate function? -On the GPU? -On the CPU? (w/, w/o UVM)

What if boundaries are transferred via MPI? -Between CUDA addresses? -UVM? -Same node? Separate nodes? -cudaAdvise - in Kokkos too?

How does high register use change these answers? (EG. in reconstruction?)

Can OpenACC give the same performance as Kokkos? As CUDA?

Do these same considerations apply for the FPGAs?

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published