Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update balar documentation #19

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 127 additions & 0 deletions docs/elements/balar/BalarInDepth.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
---
title: Balar In Depth
---

This doc provide some high level views on various aspects of *balar*.

## balar CUDA calls dispatch mechanism

In *balar*, every CUDA API call and return are represented by `SST::BalarComponent::BalarCudaCallPacket_t` and `SST::BalarComponent::BalarCudaCallReturnPacket_t`. These two structures contain necessary arguments for CUDA function calls and return values.

Since *balar* is a [MMIO](https://sst-simulator.org/sst-docs/docs/elements/memHierarchy/stdmem#mmio) (memory mapped IO) device, it receives CUDA call packets via incoming writes to its mapped address. Specifically, it follows the dispatch sequence as follow:

```mermaid
sequenceDiagram
autonumber
participant dmaEngine
participant balarMMIO
participant balarTestCPU
participant memory
balarTestCPU->>memory: Write CUDA API packet to<br/>scratch memory location
balarTestCPU->>balarMMIO: Write pointer to<br/>scratch memory location
balarMMIO->>dmaEngine: Issue a Read to<br/>retrieve the CUDA packet
dmaEngine->>balarMMIO: Return packet
balarMMIO->>balarMMIO: Call GPGPU-Sim functions
balarMMIO->>memory: Write CUDA return packet to pointer
balarMMIO->>balarTestCPU: Send response to the initial write
```

:::note

*BalarTestCPU* writes the pointer to the CUDA packet into *balar*'s MMIO address range, which *balar* will use this to copy the actual packet content into simulator memory space.

:::

:::note

With direct-execution, there are some differences with `cudaMemcpy()` function calls. Specifically, *balar* will need to copy data from SST memory system with `cudaMemcpyHostToDevice` and copy data from simulator memory space into SST memory with `cudaMemcpyDeviceToHost` using *dmaEngine*.

:::

## Custom CUDA runtime library

Located in `src/sst/elements/balar/tests/vanadisLLVMRISCV`, the custom runtime lib `cuda_runtime_api_vanadis.cc` will be linked with CUDA programs. For most CUDA APIs, it will create `SST::BalarComponent::BalarCudaCallPacket_t` packets and send pointers to the packets to *balar*.

For each CUDA call using `makeCudaCall()`, *balar* will first map its MMIO into *vanadis*'s virtual memory with memory fencing ops first. The actual `mmap` call is performed via inline assembly code to avoid invalid accesses into *balar*'s MMIO address due to OoO execution. *Balar* will unmap immediately after pointer is written for the same reason.

- For blocking CUDA calls, *balar* will poll on the last CUDA API return status via `readLastCudaStatus()` until the operation is completed.
- For non-blocking CUDA calls, *balar* will return immediately.

## Trace-driven mode component setup

We provided a config script `src/sst/elements/balar/tests/testBalar-testcpu.py` to run with trace information. The configuration graph roughly looks like this:

```mermaid
flowchart TD
balarTestCPU
balarMMIO
dmaEngine
memory
router
balarTestCPU <--> router
balarMMIO <--mmio_iface--> router
dmaEngine <--mem_iface--> router
dmaEngine <--mmio_iface--> router
memory <--> router
```

*dmaEngine* has two memory interfaces. One for receiving commands (`mmio_iface`) and the other is used to access data (`mem_iface`).

## Direct-execution mode component setup

For direct-execution with *vanadis*, the config script is at `src/sst/elements/balar/tests/testBalar-vanadis.py`, with configuration graph:

```mermaid
flowchart TD
vanadisCore
coreTLB
coreCache
vanadisOS
osMMU
balar
balarTLB
dmaEngine
memory
router
coreCacheBus{{coreCacheBus}}

subgraph " "
direction LR
subgraph VanadisCPU
direction TB
vanadisCore <--> coreTLB
coreTLB <--> coreCacheBus
coreCacheBus <--> coreCache
end

subgraph OS
direction TB
vanadisOS <--> osMMU
end
balarTLB <--> coreCacheBus
balarTLB <--MMU::m_nicTlbLink--> osMMU
vanadisCore <--> vanadisOS
coreTLB <--MMU::m_coreLinks--> osMMU
subgraph Balar
direction TB
dmaEngine <--> balarTLB
balar
end
end
balar <--mmio interface--> router
coreCache <--> router
dmaEngine <--mmio interface--> router
router <--> memory
```

:::note

Some details are omitted for simplicity.

:::

:::note

*balar* needs a TLB as *vanadis* works in virtual memory space. That part of the configuration script is based on the test example for [*rdmaNic*](../rdmaNic/intro.md).

:::
84 changes: 84 additions & 0 deletions docs/elements/balar/CompilingRISCVCUDA.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
---
title: Compiling RISCV + CUDA
---

This page provides information to compile a CUDA program from source code into binary that can be run with *vanadis* and *balar*.

:::info

This documentation assumes you have followed the [QuickStart](./QuickStart.md#llvm--riscv-gnu-toolchain) guide on setting up LLVM and RISCV GNU toolchain.

:::

Since no real GPU will be used during the simulation, a custom CUDA runtime library is needed to intercept the CUDA API calls from CPU code and dispatch those to *balar* and GPGPU-Sim. You can find the source code for the custom CUDA runtime at `src/sst/elements/balar/tests/vanadisLLVMRISCV/cuda_runtime_api_vanadis.cc`

In the same folder, there is a Makefile handling compiling the custom runtime and a simple vector add program.

```Makefile title="Makefile"
# Custom Vanadis CUDA lib
CUSTOM_CUDA_LIB_SRC := cuda_runtime_api_vanadis.cc
CUSTOM_CUDA_LIB := cudart_vanadis

# Clang prefix
CLANG_PREFIX := $(LLVM_INSTALL_PATH)/bin/

# RISCV Toolchain path
# Need to set RISCV_TOOLCHAIN_INSTALL_PATH env variable
RISCV_PREFIX := $(RISCV_TOOLCHAIN_INSTALL_PATH)/bin/riscv64-unknown-linux-gnu-
RISCV_SYSROOT := $(RISCV_TOOLCHAIN_INSTALL_PATH)/sysroot/
RISCV_CXX := $(wildcard $(RISCV_TOOLCHAIN_INSTALL_PATH)/riscv64-unknown-linux-gnu/include/c++/*/)

# Clang flags
C_FLAGS_RISCV_LINUX := --target=riscv64-unknown-linux-gnu -march=rv64gc -static-libgcc --gcc-toolchain=$(RISCV_TOOLCHAIN_INSTALL_PATH) --sysroot=$(RISCV_SYSROOT)
CXX_FLAGS_RISCV_LINUX := -static-libgcc -static-libstdc++ -static --gcc-toolchain=$(RISCV_TOOLCHAIN_INSTALL_PATH) --sysroot=$(RISCV_SYSROOT) -nostdlibinc -stdlib++-isystem$(RISCV_CXX)/ -stdlib++-isystem$(RISCV_SYSROOT)/usr/include -stdlib++-isystem$(RISCV_CXX)/riscv64-unknown-linux-gnu --target=riscv64-unknown-linux-gnu -march=rv64gc
CXX_FLAGS_RISCV_LINUX_CLANG_ONLY := --gcc-toolchain=$(RISCV_TOOLCHAIN_INSTALL_PATH) --sysroot=$(RISCV_SYSROOT) -I$(RISCV_SYSROOT)/usr/include --target=riscv64 -march=rv64gc
CXX_FLAGS_CUDA := --cuda-gpu-arch=$(GPU_ARCH) --cuda-path=$(CUDA_INSTALL_PATH) -L$(CUDA_INSTALL_PATH)/lib64 -lcudart -ldl -lrt -pthread
CXX_FLAGS_RISCV_CUDA := -static --cuda-gpu-arch=$(GPU_ARCH) --cuda-path=$(CUDA_INSTALL_PATH) -L. -Wl,-static -l$(CUSTOM_CUDA_LIB) -L$(CUDA_INSTALL_PATH)/lib64 -Wl,-static -ldl -lrt -pthread
CXX_FLAGS_CUDA_WRAPPER := -include __clang_cuda_runtime_wrapper.h

all: helloworld vecadd

test_llvm_env:
ifeq ($(LLVM_INSTALL_PATH),)
$(error env LLVM_INSTALL_PATH is not set)
endif

test_vanadis_envs:
ifeq ($(RISCV_TOOLCHAIN_INSTALL_PATH),)
$(error env RISCV_TOOLCHAIN_INSTALL_PATH is not set)
endif
ifeq ($(GPU_ARCH),)
$(error env GPU_ARCH is not set)
endif

# Sample helloworld program, pure CPU code
# test if clang is working properly
helloworld: helloworld.c test_llvm_env
$(CLANG_PREFIX)clang -static -mno-relax $(C_FLAGS_RISCV_LINUX) $< -o $@

# Custom CUDA runtime target
# Have to embed the CUDA version information for GPGPU-Sim
vanadis_cuda: $(CUSTOM_CUDA_LIB_SRC) test_vanadis_envs
$(RISCV_PREFIX)gcc -c -static -I$(CUDA_INSTALL_PATH)/include -I. -DBALAR_CUDA_VERSION=\"libcudart_vanadis.a.$(shell echo $$CUDA_VERSION_NUMBER | cut -c 1-2)\" -fpic $<
$(RISCV_PREFIX)gcc -shared -o lib$(CUSTOM_CUDA_LIB).so $(CUSTOM_CUDA_LIB_SRC:.cc=.o)
$(RISCV_PREFIX)ar rcs lib$(CUSTOM_CUDA_LIB).a $(CUSTOM_CUDA_LIB_SRC:.cc=.o)

# RISCV + CUDA binary
vecadd: vecadd.cu test_vanadis_envs test_llvm_env vanadis_cuda
$(CLANG_PREFIX)clang++ $< -o $@ $(CXX_FLAGS_RISCV_CUDA) $(CXX_FLAGS_CUDA_WRAPPER) $(CXX_FLAGS_RISCV_LINUX)

# x86 + CUDA binary
vecadd_normal: vecadd.cu
$(CLANG_PREFIX)clang++ $< -o $@ $(CXX_FLAGS_CUDA_WRAPPER) $(CXX_FLAGS_CUDA)

clean:
rm -f lib$(CUSTOM_CUDA_LIB).so lib$(CUSTOM_CUDA_LIB).a $(CUSTOM_CUDA_LIB_SRC:.cc=.o) vecadd helloworld
```

For more complicated CUDA program, you can refer to the flags and CC setup in [gpu-app-collection's Makefile](https://github.com/accel-sim/gpu-app-collection/blob/7db54738af0aed3760f496f6c968ee5a40c0ee46/src/cuda/common/common.mk#L77-L106).

:::note

Vanadis currently only support programs that are statically linked.

:::
181 changes: 181 additions & 0 deletions docs/elements/balar/QuickStart.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
---
title: QuickStart
---

This page provides instructions to setup balar and run test examples.

## Prerequisites

### CUDA
Please refer to [NVIDIA's website](https://developer.nvidia.com/cuda-downloads) for setting up CUDA.

After CUDA is installed, you would need to set this environment variable `CUDA_INSTALL_PATH`:

```bash
# Assuming CUDA is installed at /usr/local/cuda
export CUDA_INSTALL_PATH=/usr/local/cuda
```

### GPGPU-Sim
First, we shall install the prerequisites for GPGPU-Sim:

```bash
sudo apt-get install cmake build-essential xutils-dev bison zlib1g-dev flex libglu1-mesa-dev
```

Then to build GPGPU-Sim:
```bash
git clone https://github.com/accel-sim/gpgpu-sim_distribution.git
cd gpgpu-sim_distribution
source setup_environment
make -j4
```

### LLVM + RISCV GNU Toolchain

If you wish to run CUDA binary with [vanadis](../vanadis/intro.md) and balar, you will need to clone LLVM and RISCV GNU toolchain to compile CUDA source code.

```bash
# Create installation dirs
mkdir llvm-install
mkdir riscv-gnu-install

# Set up environment vars to LLVM and RISCV GCC installation folders
export LLVM_INSTALL_PATH=$(pwd)/llvm-install
export RISCV_TOOLCHAIN_INSTALL_PATH=$(pwd)/riscv-gnu-install

# Build LLVM with RISC-V, x86, and CUDA support from source
# x86 is included for testing purpose, you can remove it if
# you will only run the CUDA binary with SST
git clone https://github.com/llvm/llvm-project.git

cd llvm-project
mkdir build && cd build
cmake -DLLVM_TARGETS_TO_BUILD="RISCV;X86;NVPTX" -DLLVM_DEFAULT_TARGET_TRIPLE=riscv64-unknown-linux-gnu \
-DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_PROJECTS="clang;lld" -DCMAKE_INSTALL_PREFIX=$LLVM_INSTALL_PATH ../llvm
cmake --build . -j8
cmake --build . --target install
cd ..

# Build RISC-V GCC toolchain
git clone https://github.com/riscv-collab/riscv-gnu-toolchain.git

cd riscv-gnu-toolchain
./configure --prefix=$RISCV_INSTALL_PATH
make linux -j8
cd ..

# Match with the GPU config file we have (V100)
export GPU_ARCH=sm_70
```

### GPU App Collection

In order to run balar's unittest, we would need to clone the [GPU app collection repo](https://github.com/accel-sim/gpu-app-collection/tree/sst_support). The unittest script will handle compilation for these kernels with the custom CUDA runtime lib.

```bash
git clone git@github.com:accel-sim/gpu-app-collection.git
cd gpu-app-collection
git checkout sst_support

# Setup environ vars for apps, need to have
# env var LLVM_INSTALL_PATH and RISCV_TOOLCHAIN_INSTALL_PATH
# If you plan to compile the apps directly, you will
# also need to set SST_CUSTOM_CUDA_LIB_PATH to
# the directory of the custom CUDA library,
# which normally will be `SST_ELEMENTS_SRC/src/sst/elements/balar/tests/vanadisLLVMRISCV`
source ./src/setup_environment sst
```

:::note Manual Compilation
If you want to compile the Rodinia benchmarks manually or want to tested out other kernels in the gpu-app-collection repo, you will need to set the `SST_CUSTOM_CUDA_LIB_PATH` env var and compile the custom CUDA runtime first.

```bash
# Set SST_CUSTOM_CUDA_LIB_PATH
export SST_CUSTOM_CUDA_LIB_PATH=SST_ELEMENTS_SRC/src/sst/elements/balar/tests/vanadisLLVMRISCV

# Build custom CUDA runtime
cd SST_ELEMENTS_SRC/src/sst/elements/balar/tests/vanadisLLVMRISCV
make

# Compile Rodinia 2.0 and pull data
cd PATH_TO/gpu-app-collection
make rodinia_2.0-ft -i -j4 -C ./src
make data -C ./src

# The compiled binaries would be located
# PATH_TO/gpu-app-collection/bin/CUDA_VERSION_NUM/release
```

:::

## Compilation

There are some subtle details need to be taken care of for *sst-core* and *sst-elements*:

```bash
# For sst-core, you will need to disable MPI and mempools
cd PATH_TO/SST_CORE_SOURCE/
./configure --prefix=$SST_CORE_HOME --disable-mpi --disable-mem-pools
make -j4
make install

# For sst-elements, you will need to specify CUDA and GPGPU-Sim path
# GPGPUSIM_ROOT will be set by sourcing the setup_environment script
cd PATH_TO/SST_ELEMENTS_SOURCE/
./configure --prefix=$SST_ELEMENTS_HOME --with-sst-core=$SST_CORE_HOME --with-cuda=$CUDA_INSTALL_PATH --with-gpgpusim=$GPGPUSIM_ROOT
make -j4
make install
```

After configuring *sst-elements*, you should have a command line output stating balar will be built. If not, you would need to check if you have CUDA and GPGPU-Sim installed and compiled properly.

## Testing

Balar divides its testcases into three testsuites based on time cost:

- *simple*: takes about 10 mins to complete
- *medium*: takes about 1 hr to complete
- *long*: takes 1~2 hrs to complete

Each All of them can be run in parallel with `-c NUM_CORES` flags.

```bash
# Run simple tests sequentially
$SST_CORE_HOME/bin/sst-test-elements -w "*balar*simple*"

# Run medium testcases with 2 processes
$SST_CORE_HOME/bin/sst-test-elements -c 2 -w "*balar*medium*"

# Run long tests with 4 processes
$SST_CORE_HOME/bin/sst-test-elements -c 4 -w "*balar*long*"

# Run all tests with 8 processes
$SST_CORE_HOME/bin/sst-test-elements -c 8 -w "*balar*"
```

When running each testsuite, it will first compiled the custom CUDA library under at `SST_ELEMENT_SOURCE/src/sst/elements/balar/tests/vanadisLLVMRISCV/` and link this with Rodinia 2.0 kernels in [gpu-app-collection](https://github.com/accel-sim/gpu-app-collection/tree/sst_support).

## Running examples

```bash
# cd into balar's tests folder
cd SST_ELEMENT_SOURCE/src/sst/elements/balar/tests

# With testcpu
make -C vectorAdd
sst testBalar-testcpu.py --model-options="-c gpu-v100-mem.cfg -x ./vectorAdd/vectorAdd -t cuda_calls.trace"

# With vanadis
# Run helloworld example, pure CPU code, no CUDA calls
make -C vanadisLLVMRISCV
vanadis_EXE=./vanadisLLVMRISCV/helloworld \
vanadis_ISA=RISCV64 \
sst testBalar-vanadis.py --model-options='-c gpu-v100-mem.cfg'

# Run a simple integer vector add example
vanadis_EXE=./vanadisLLVMRISCV/vecadd \
vanadis_ISA=RISCV64 \
BALAR_CUDA_EXE_PATH=./vanadisLLVMRISCV/vecadd \
sst testBalar-vanadis.py --model-options='-c gpu-v100-mem.cfg'
```
Loading