sstsimulator · William-An · Oct 4, 2024 · Oct 5, 2024 · Oct 5, 2024 · Oct 5, 2024
diff --git a/docs/elements/balar/BalarInDepth.md b/docs/elements/balar/BalarInDepth.md
@@ -0,0 +1,127 @@
+---
+title: Balar In Depth
+---
+
+This doc provide some high level views on various aspects of *balar*.
+
+## balar CUDA calls dispatch mechanism
+
+In *balar*, every CUDA API call and return are represented by `SST::BalarComponent::BalarCudaCallPacket_t` and `SST::BalarComponent::BalarCudaCallReturnPacket_t`. These two structures contain necessary arguments for CUDA function calls and return values.
+
+Since *balar* is a [MMIO](https://sst-simulator.org/sst-docs/docs/elements/memHierarchy/stdmem#mmio) (memory mapped IO) device, it receives CUDA call packets via incoming writes to its mapped address. Specifically, it follows the dispatch sequence as follow:
+
+```mermaid
+sequenceDiagram
+    autonumber
+    participant dmaEngine
+    participant balarMMIO
+    participant balarTestCPU
+    participant memory
+    balarTestCPU->>memory: Write CUDA API packet to<br/>scratch memory location
+    balarTestCPU->>balarMMIO: Write pointer to<br/>scratch memory location
+    balarMMIO->>dmaEngine: Issue a Read to<br/>retrieve the CUDA packet
+    dmaEngine->>balarMMIO: Return packet
+    balarMMIO->>balarMMIO: Call GPGPU-Sim functions
+    balarMMIO->>memory: Write CUDA return packet to pointer
+    balarMMIO->>balarTestCPU: Send response to the initial write 
+```
+
+:::note
+
+*BalarTestCPU* writes the pointer to the CUDA packet into *balar*'s MMIO address range, which *balar* will use this to copy the actual packet content into simulator memory space.
+
+:::
+
+:::note
+
+With direct-execution, there are some differences with `cudaMemcpy()` function calls. Specifically, *balar* will need to copy data from SST memory system with `cudaMemcpyHostToDevice` and copy data from simulator memory space into SST memory with `cudaMemcpyDeviceToHost` using *dmaEngine*.
+
+:::
+
+## Custom CUDA runtime library
+
+Located in `src/sst/elements/balar/tests/vanadisLLVMRISCV`, the custom runtime lib `cuda_runtime_api_vanadis.cc` will be linked with CUDA programs. For most CUDA APIs, it will create `SST::BalarComponent::BalarCudaCallPacket_t` packets and send pointers to the packets to *balar*.
+
+For each CUDA call using `makeCudaCall()`, *balar* will first map its MMIO into *vanadis*'s virtual memory with memory fencing ops first. The actual `mmap` call is performed via inline assembly code to avoid invalid accesses into *balar*'s MMIO address due to OoO execution. *Balar* will unmap immediately after pointer is written for the same reason.
+
+- For blocking CUDA calls, *balar* will poll on the last CUDA API return status via `readLastCudaStatus()` until the operation is completed.
+- For non-blocking CUDA calls, *balar* will return immediately.
+
+## Trace-driven mode component setup
+
+We provided a config script `src/sst/elements/balar/tests/testBalar-testcpu.py` to run with trace information. The configuration graph roughly looks like this:
+
+```mermaid
+flowchart TD
+	balarTestCPU
+	balarMMIO
+	dmaEngine
+	memory
+	router
+	balarTestCPU <--> router
+	balarMMIO <--mmio_iface--> router
+	dmaEngine <--mem_iface--> router
+	dmaEngine <--mmio_iface--> router
+	memory <--> router
+```
+
+*dmaEngine* has two memory interfaces. One for receiving commands (`mmio_iface`) and the other is used to access data (`mem_iface`).
+
+## Direct-execution mode component setup
+
+For direct-execution with *vanadis*, the config script is at `src/sst/elements/balar/tests/testBalar-vanadis.py`, with configuration graph:
+
+```mermaid
+flowchart TD
+	vanadisCore
+	coreTLB
+	coreCache
+	vanadisOS
+	osMMU
+	balar
+	balarTLB
+	dmaEngine
+	memory
+	router
+	coreCacheBus{{coreCacheBus}}
+
+	subgraph " "
+		direction LR
+		subgraph VanadisCPU
+			direction TB
+			vanadisCore <--> coreTLB
+			coreTLB <--> coreCacheBus
+			coreCacheBus <--> coreCache
+		end
+
+		subgraph OS
+			direction TB
+			vanadisOS <--> osMMU
+		end
+		balarTLB <--> coreCacheBus
+		balarTLB <--MMU::m_nicTlbLink--> osMMU
+		vanadisCore <--> vanadisOS
+		coreTLB <--MMU::m_coreLinks--> osMMU
+		subgraph Balar
+			direction TB
+			dmaEngine <--> balarTLB
+			balar
+		end
+	end
+	balar <--mmio interface--> router
+	coreCache <--> router
+	dmaEngine <--mmio interface--> router
+	router <--> memory
+```
+
+:::note
+
+Some details are omitted for simplicity.
+
+:::
+
+:::note
+
+*balar* needs a TLB as *vanadis* works in virtual memory space. That part of the configuration script is based on the test example for [*rdmaNic*](../rdmaNic/intro.md).
+
+:::
diff --git a/docs/elements/balar/CompilingRISCVCUDA.md b/docs/elements/balar/CompilingRISCVCUDA.md
@@ -0,0 +1,84 @@
+---
+title: Compiling RISCV + CUDA
+---
+
+This page provides information to compile a CUDA program from source code into binary that can be run with *vanadis* and *balar*.
+
+:::info
+
+This documentation assumes you have followed the [QuickStart](./QuickStart.md#llvm--riscv-gnu-toolchain) guide on setting up LLVM and RISCV GNU toolchain.
+
+:::
+
+Since no real GPU will be used during the simulation, a custom CUDA runtime library is needed to intercept the CUDA API calls from CPU code and dispatch those to *balar* and GPGPU-Sim. You can find the source code for the custom CUDA runtime at `src/sst/elements/balar/tests/vanadisLLVMRISCV/cuda_runtime_api_vanadis.cc`
+
+In the same folder, there is a Makefile handling compiling the custom runtime and a simple vector add program. 
+
+```Makefile title="Makefile"
+# Custom Vanadis CUDA lib
+CUSTOM_CUDA_LIB_SRC := cuda_runtime_api_vanadis.cc
+CUSTOM_CUDA_LIB := cudart_vanadis
+
+# Clang prefix
+CLANG_PREFIX := $(LLVM_INSTALL_PATH)/bin/
+
+# RISCV Toolchain path
+# Need to set RISCV_TOOLCHAIN_INSTALL_PATH env variable
+RISCV_PREFIX := $(RISCV_TOOLCHAIN_INSTALL_PATH)/bin/riscv64-unknown-linux-gnu-
+RISCV_SYSROOT := $(RISCV_TOOLCHAIN_INSTALL_PATH)/sysroot/
+RISCV_CXX := $(wildcard $(RISCV_TOOLCHAIN_INSTALL_PATH)/riscv64-unknown-linux-gnu/include/c++/*/)
+
+# Clang flags
+C_FLAGS_RISCV_LINUX := --target=riscv64-unknown-linux-gnu -march=rv64gc -static-libgcc --gcc-toolchain=$(RISCV_TOOLCHAIN_INSTALL_PATH) --sysroot=$(RISCV_SYSROOT) 
+CXX_FLAGS_RISCV_LINUX := -static-libgcc -static-libstdc++ -static --gcc-toolchain=$(RISCV_TOOLCHAIN_INSTALL_PATH) --sysroot=$(RISCV_SYSROOT) -nostdlibinc -stdlib++-isystem$(RISCV_CXX)/ -stdlib++-isystem$(RISCV_SYSROOT)/usr/include -stdlib++-isystem$(RISCV_CXX)/riscv64-unknown-linux-gnu --target=riscv64-unknown-linux-gnu  -march=rv64gc
+CXX_FLAGS_RISCV_LINUX_CLANG_ONLY := --gcc-toolchain=$(RISCV_TOOLCHAIN_INSTALL_PATH) --sysroot=$(RISCV_SYSROOT) -I$(RISCV_SYSROOT)/usr/include --target=riscv64 -march=rv64gc
+CXX_FLAGS_CUDA := --cuda-gpu-arch=$(GPU_ARCH) --cuda-path=$(CUDA_INSTALL_PATH) -L$(CUDA_INSTALL_PATH)/lib64 -lcudart -ldl -lrt -pthread
+CXX_FLAGS_RISCV_CUDA := -static --cuda-gpu-arch=$(GPU_ARCH) --cuda-path=$(CUDA_INSTALL_PATH) -L. -Wl,-static -l$(CUSTOM_CUDA_LIB) -L$(CUDA_INSTALL_PATH)/lib64 -Wl,-static -ldl -lrt -pthread
+CXX_FLAGS_CUDA_WRAPPER := -include __clang_cuda_runtime_wrapper.h
+
+all: helloworld vecadd
+
+test_llvm_env:
+ifeq ($(LLVM_INSTALL_PATH),)
+	$(error env LLVM_INSTALL_PATH is not set)
+endif
+
+test_vanadis_envs:
+ifeq ($(RISCV_TOOLCHAIN_INSTALL_PATH),)
+	$(error env RISCV_TOOLCHAIN_INSTALL_PATH is not set)
+endif
+ifeq ($(GPU_ARCH),)
+	$(error env GPU_ARCH is not set)
+endif
+
+# Sample helloworld program, pure CPU code
+# test if clang is working properly
+helloworld: helloworld.c test_llvm_env
+	$(CLANG_PREFIX)clang -static -mno-relax $(C_FLAGS_RISCV_LINUX) $< -o $@
+
+# Custom CUDA runtime target
+# Have to embed the CUDA version information for GPGPU-Sim
+vanadis_cuda: $(CUSTOM_CUDA_LIB_SRC) test_vanadis_envs
+	$(RISCV_PREFIX)gcc -c -static -I$(CUDA_INSTALL_PATH)/include -I. -DBALAR_CUDA_VERSION=\"libcudart_vanadis.a.$(shell echo $$CUDA_VERSION_NUMBER | cut -c 1-2)\" -fpic $<
+	$(RISCV_PREFIX)gcc -shared -o lib$(CUSTOM_CUDA_LIB).so $(CUSTOM_CUDA_LIB_SRC:.cc=.o)
+	$(RISCV_PREFIX)ar rcs lib$(CUSTOM_CUDA_LIB).a $(CUSTOM_CUDA_LIB_SRC:.cc=.o)
+
+# RISCV + CUDA binary
+vecadd: vecadd.cu test_vanadis_envs test_llvm_env vanadis_cuda
+	$(CLANG_PREFIX)clang++ $< -o $@ $(CXX_FLAGS_RISCV_CUDA) $(CXX_FLAGS_CUDA_WRAPPER) $(CXX_FLAGS_RISCV_LINUX)
+
+# x86 + CUDA binary
+vecadd_normal: vecadd.cu
+	$(CLANG_PREFIX)clang++ $< -o $@ $(CXX_FLAGS_CUDA_WRAPPER) $(CXX_FLAGS_CUDA) 
+
+clean:
+	rm -f lib$(CUSTOM_CUDA_LIB).so lib$(CUSTOM_CUDA_LIB).a $(CUSTOM_CUDA_LIB_SRC:.cc=.o) vecadd helloworld
+```
+
+For more complicated CUDA program, you can refer to the flags and CC setup in [gpu-app-collection's Makefile](https://github.com/accel-sim/gpu-app-collection/blob/7db54738af0aed3760f496f6c968ee5a40c0ee46/src/cuda/common/common.mk#L77-L106).
+
+:::note
+
+Vanadis currently only support programs that are statically linked.
+
+:::
diff --git a/docs/elements/balar/QuickStart.md b/docs/elements/balar/QuickStart.md
@@ -0,0 +1,181 @@
+---
+title: QuickStart
+---
+
+This page provides instructions to setup balar and run test examples.
+
+## Prerequisites
+
+### CUDA
+Please refer to [NVIDIA's website](https://developer.nvidia.com/cuda-downloads) for setting up CUDA.
+
+After CUDA is installed, you would need to set this environment variable `CUDA_INSTALL_PATH`:
+
+```bash
+# Assuming CUDA is installed at /usr/local/cuda
+export CUDA_INSTALL_PATH=/usr/local/cuda
+```
+
+### GPGPU-Sim
+First, we shall install the prerequisites for GPGPU-Sim:
+
+```bash
+sudo apt-get install cmake build-essential xutils-dev bison zlib1g-dev flex libglu1-mesa-dev
+```
+
+Then to build GPGPU-Sim:
+```bash
+git clone https://github.com/accel-sim/gpgpu-sim_distribution.git
+cd gpgpu-sim_distribution
+source setup_environment
+make -j4
+```
+
+### LLVM + RISCV GNU Toolchain
+
+If you wish to run CUDA binary with [vanadis](../vanadis/intro.md) and balar, you will need to clone LLVM and RISCV GNU toolchain to compile CUDA source code.
+
+```bash
+# Create installation dirs
+mkdir llvm-install
+mkdir riscv-gnu-install
+
+# Set up environment vars to LLVM and RISCV GCC installation folders
+export LLVM_INSTALL_PATH=$(pwd)/llvm-install
+export RISCV_TOOLCHAIN_INSTALL_PATH=$(pwd)/riscv-gnu-install
+
+# Build LLVM with RISC-V, x86, and CUDA support from source
+# x86 is included for testing purpose, you can remove it if
+# you will only run the CUDA binary with SST
+git clone https://github.com/llvm/llvm-project.git
+
+cd llvm-project
+mkdir build && cd build
+cmake -DLLVM_TARGETS_TO_BUILD="RISCV;X86;NVPTX" -DLLVM_DEFAULT_TARGET_TRIPLE=riscv64-unknown-linux-gnu \
+      -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_PROJECTS="clang;lld" -DCMAKE_INSTALL_PREFIX=$LLVM_INSTALL_PATH ../llvm
+cmake --build . -j8
+cmake --build . --target install
+cd ..
+
+# Build RISC-V GCC toolchain
+git clone https://github.com/riscv-collab/riscv-gnu-toolchain.git
+
+cd riscv-gnu-toolchain
+./configure --prefix=$RISCV_INSTALL_PATH
+make linux -j8
+cd ..
+
+# Match with the GPU config file we have (V100)
+export GPU_ARCH=sm_70
+```
+
+### GPU App Collection
+
+In order to run balar's unittest, we would need to clone the [GPU app collection repo](https://github.com/accel-sim/gpu-app-collection/tree/sst_support). The unittest script will handle compilation for these kernels with the custom CUDA runtime lib.
+
+```bash
+git clone git@github.com:accel-sim/gpu-app-collection.git
+cd gpu-app-collection
+git checkout sst_support
+
+# Setup environ vars for apps, need to have
+# env var LLVM_INSTALL_PATH and RISCV_TOOLCHAIN_INSTALL_PATH
+# If you plan to compile the apps directly, you will 
+# also need to set SST_CUSTOM_CUDA_LIB_PATH to 
+# the directory of the custom CUDA library,
+# which normally will be `SST_ELEMENTS_SRC/src/sst/elements/balar/tests/vanadisLLVMRISCV`
+source ./src/setup_environment sst
+```
+
+:::note Manual Compilation
+If you want to compile the Rodinia benchmarks manually or want to tested out other kernels in the gpu-app-collection repo, you will need to set the `SST_CUSTOM_CUDA_LIB_PATH` env var and compile the custom CUDA runtime first.
+
+```bash
+# Set SST_CUSTOM_CUDA_LIB_PATH
+export SST_CUSTOM_CUDA_LIB_PATH=SST_ELEMENTS_SRC/src/sst/elements/balar/tests/vanadisLLVMRISCV
+
+# Build custom CUDA runtime
+cd SST_ELEMENTS_SRC/src/sst/elements/balar/tests/vanadisLLVMRISCV
+make
+
+# Compile Rodinia 2.0 and pull data
+cd PATH_TO/gpu-app-collection
+make rodinia_2.0-ft -i -j4 -C ./src
+make data -C ./src
+
+# The compiled binaries would be located
+# PATH_TO/gpu-app-collection/bin/CUDA_VERSION_NUM/release
+```
+
+:::
+
+## Compilation
+
+There are some subtle details need to be taken care of for *sst-core* and *sst-elements*:
+
+```bash
+# For sst-core, you will need to disable MPI and mempools
+cd PATH_TO/SST_CORE_SOURCE/
+./configure --prefix=$SST_CORE_HOME --disable-mpi --disable-mem-pools
+make -j4
+make install
+
+# For sst-elements, you will need to specify CUDA and GPGPU-Sim path
+# GPGPUSIM_ROOT will be set by sourcing the setup_environment script
+cd PATH_TO/SST_ELEMENTS_SOURCE/
+./configure --prefix=$SST_ELEMENTS_HOME --with-sst-core=$SST_CORE_HOME --with-cuda=$CUDA_INSTALL_PATH --with-gpgpusim=$GPGPUSIM_ROOT
+make -j4
+make install
+```
+
+After configuring *sst-elements*, you should have a command line output stating balar will be built. If not, you would need to check if you have CUDA and GPGPU-Sim installed and compiled properly.
+
+## Testing
+
+Balar divides its testcases into three testsuites based on time cost:
+
+- *simple*: takes about 10 mins to complete
+- *medium*: takes about 1 hr to complete
+- *long*: takes 1~2 hrs to complete
+
+Each All of them can be run in parallel with `-c NUM_CORES` flags.
+
+```bash
+# Run simple tests sequentially
+$SST_CORE_HOME/bin/sst-test-elements -w "*balar*simple*"
+
+# Run medium testcases with 2 processes
+$SST_CORE_HOME/bin/sst-test-elements -c 2 -w "*balar*medium*"
+
+# Run long tests with 4 processes
+$SST_CORE_HOME/bin/sst-test-elements -c 4 -w "*balar*long*"
+
+# Run all tests with 8 processes
+$SST_CORE_HOME/bin/sst-test-elements -c 8 -w "*balar*"
+```
+
+When running each testsuite, it will first compiled the custom CUDA library under at `SST_ELEMENT_SOURCE/src/sst/elements/balar/tests/vanadisLLVMRISCV/` and link this with Rodinia 2.0 kernels in [gpu-app-collection](https://github.com/accel-sim/gpu-app-collection/tree/sst_support).
+
+## Running examples
+
+```bash
+# cd into balar's tests folder
+cd SST_ELEMENT_SOURCE/src/sst/elements/balar/tests
+
+# With testcpu
+make -C vectorAdd
+sst testBalar-testcpu.py --model-options="-c gpu-v100-mem.cfg -x ./vectorAdd/vectorAdd -t cuda_calls.trace"
+
+# With vanadis 
+# Run helloworld example, pure CPU code, no CUDA calls
+make -C vanadisLLVMRISCV
+vanadis_EXE=./vanadisLLVMRISCV/helloworld \
+vanadis_ISA=RISCV64 \
+sst testBalar-vanadis.py --model-options='-c gpu-v100-mem.cfg'
+
+# Run a simple integer vector add example
+vanadis_EXE=./vanadisLLVMRISCV/vecadd \
+vanadis_ISA=RISCV64 \
+BALAR_CUDA_EXE_PATH=./vanadisLLVMRISCV/vecadd \
+sst testBalar-vanadis.py --model-options='-c gpu-v100-mem.cfg'
+```