diff --git a/docs/elements/balar/BalarInDepth.md b/docs/elements/balar/BalarInDepth.md
new file mode 100644
index 00000000..3653fb20
--- /dev/null
+++ b/docs/elements/balar/BalarInDepth.md
@@ -0,0 +1,127 @@
+---
+title: Balar In Depth
+---
+
+This doc provide some high level views on various aspects of *balar*.
+
+## balar CUDA calls dispatch mechanism
+
+In *balar*, every CUDA API call and return are represented by `SST::BalarComponent::BalarCudaCallPacket_t` and `SST::BalarComponent::BalarCudaCallReturnPacket_t`. These two structures contain necessary arguments for CUDA function calls and return values.
+
+Since *balar* is a [MMIO](https://sst-simulator.org/sst-docs/docs/elements/memHierarchy/stdmem#mmio) (memory mapped IO) device, it receives CUDA call packets via incoming writes to its mapped address. Specifically, it follows the dispatch sequence as follow:
+
+```mermaid
+sequenceDiagram
+    autonumber
+    participant dmaEngine
+    participant balarMMIO
+    participant balarTestCPU
+    participant memory
+    balarTestCPU->>memory: Write CUDA API packet to<br/>scratch memory location
+    balarTestCPU->>balarMMIO: Write pointer to<br/>scratch memory location
+    balarMMIO->>dmaEngine: Issue a Read to<br/>retrieve the CUDA packet
+    dmaEngine->>balarMMIO: Return packet
+    balarMMIO->>balarMMIO: Call GPGPU-Sim functions
+    balarMMIO->>memory: Write CUDA return packet to pointer
+    balarMMIO->>balarTestCPU: Send response to the initial write 
+```
+
+:::note
+
+*BalarTestCPU* writes the pointer to the CUDA packet into *balar*'s MMIO address range, which *balar* will use this to copy the actual packet content into simulator memory space.
+
+:::
+
+:::note
+
+With direct-execution, there are some differences with `cudaMemcpy()` function calls. Specifically, *balar* will need to copy data from SST memory system with `cudaMemcpyHostToDevice` and copy data from simulator memory space into SST memory with `cudaMemcpyDeviceToHost` using *dmaEngine*.
+
+:::
+
+## Custom CUDA runtime library
+
+Located in `src/sst/elements/balar/tests/vanadisLLVMRISCV`, the custom runtime lib `cuda_runtime_api_vanadis.cc` will be linked with CUDA programs. For most CUDA APIs, it will create `SST::BalarComponent::BalarCudaCallPacket_t` packets and send pointers to the packets to *balar*.
+
+For each CUDA call using `makeCudaCall()`, *balar* will first map its MMIO into *vanadis*'s virtual memory with memory fencing ops first. The actual `mmap` call is performed via inline assembly code to avoid invalid accesses into *balar*'s MMIO address due to OoO execution. *Balar* will unmap immediately after pointer is written for the same reason.
+
+- For blocking CUDA calls, *balar* will poll on the last CUDA API return status via `readLastCudaStatus()` until the operation is completed.
+- For non-blocking CUDA calls, *balar* will return immediately.
+
+## Trace-driven mode component setup
+
+We provided a config script `src/sst/elements/balar/tests/testBalar-testcpu.py` to run with trace information. The configuration graph roughly looks like this:
+
+```mermaid
+flowchart TD
+	balarTestCPU
+	balarMMIO
+	dmaEngine
+	memory
+	router
+	balarTestCPU <--> router
+	balarMMIO <--mmio_iface--> router
+	dmaEngine <--mem_iface--> router
+	dmaEngine <--mmio_iface--> router
+	memory <--> router
+```
+
+*dmaEngine* has two memory interfaces. One for receiving commands (`mmio_iface`) and the other is used to access data (`mem_iface`).
+
+## Direct-execution mode component setup
+
+For direct-execution with *vanadis*, the config script is at `src/sst/elements/balar/tests/testBalar-vanadis.py`, with configuration graph:
+
+```mermaid
+flowchart TD
+	vanadisCore
+	coreTLB
+	coreCache
+	vanadisOS
+	osMMU
+	balar
+	balarTLB
+	dmaEngine
+	memory
+	router
+	coreCacheBus{{coreCacheBus}}
+
+	subgraph " "
+		direction LR
+		subgraph VanadisCPU
+			direction TB
+			vanadisCore <--> coreTLB
+			coreTLB <--> coreCacheBus
+			coreCacheBus <--> coreCache
+		end
+		
+		subgraph OS
+			direction TB
+			vanadisOS <--> osMMU
+		end
+		balarTLB <--> coreCacheBus
+		balarTLB <--MMU::m_nicTlbLink--> osMMU
+		vanadisCore <--> vanadisOS
+		coreTLB <--MMU::m_coreLinks--> osMMU
+		subgraph Balar
+			direction TB
+			dmaEngine <--> balarTLB
+			balar
+		end
+	end
+	balar <--mmio interface--> router
+	coreCache <--> router
+	dmaEngine <--mmio interface--> router
+	router <--> memory
+```
+
+:::note
+
+Some details are omitted for simplicity.
+
+:::
+
+:::note
+
+*balar* needs a TLB as *vanadis* works in virtual memory space. That part of the configuration script is based on the test example for [*rdmaNic*](../rdmaNic/intro.md).
+
+:::
diff --git a/docs/elements/balar/CompilingRISCVCUDA.md b/docs/elements/balar/CompilingRISCVCUDA.md
new file mode 100644
index 00000000..c0e233c0
--- /dev/null
+++ b/docs/elements/balar/CompilingRISCVCUDA.md
@@ -0,0 +1,84 @@
+---
+title: Compiling RISCV + CUDA
+---
+
+This page provides information to compile a CUDA program from source code into binary that can be run with *vanadis* and *balar*.
+
+:::info
+
+This documentation assumes you have followed the [QuickStart](./QuickStart.md#llvm--riscv-gnu-toolchain) guide on setting up LLVM and RISCV GNU toolchain.
+
+:::
+
+Since no real GPU will be used during the simulation, a custom CUDA runtime library is needed to intercept the CUDA API calls from CPU code and dispatch those to *balar* and GPGPU-Sim. You can find the source code for the custom CUDA runtime at `src/sst/elements/balar/tests/vanadisLLVMRISCV/cuda_runtime_api_vanadis.cc`
+
+In the same folder, there is a Makefile handling compiling the custom runtime and a simple vector add program. 
+
+```Makefile title="Makefile"
+# Custom Vanadis CUDA lib
+CUSTOM_CUDA_LIB_SRC := cuda_runtime_api_vanadis.cc
+CUSTOM_CUDA_LIB := cudart_vanadis
+
+# Clang prefix
+CLANG_PREFIX := $(LLVM_INSTALL_PATH)/bin/
+
+# RISCV Toolchain path
+# Need to set RISCV_TOOLCHAIN_INSTALL_PATH env variable
+RISCV_PREFIX := $(RISCV_TOOLCHAIN_INSTALL_PATH)/bin/riscv64-unknown-linux-gnu-
+RISCV_SYSROOT := $(RISCV_TOOLCHAIN_INSTALL_PATH)/sysroot/
+RISCV_CXX := $(wildcard $(RISCV_TOOLCHAIN_INSTALL_PATH)/riscv64-unknown-linux-gnu/include/c++/*/)
+
+# Clang flags
+C_FLAGS_RISCV_LINUX := --target=riscv64-unknown-linux-gnu -march=rv64gc -static-libgcc --gcc-toolchain=$(RISCV_TOOLCHAIN_INSTALL_PATH) --sysroot=$(RISCV_SYSROOT) 
+CXX_FLAGS_RISCV_LINUX := -static-libgcc -static-libstdc++ -static --gcc-toolchain=$(RISCV_TOOLCHAIN_INSTALL_PATH) --sysroot=$(RISCV_SYSROOT) -nostdlibinc -stdlib++-isystem$(RISCV_CXX)/ -stdlib++-isystem$(RISCV_SYSROOT)/usr/include -stdlib++-isystem$(RISCV_CXX)/riscv64-unknown-linux-gnu --target=riscv64-unknown-linux-gnu  -march=rv64gc
+CXX_FLAGS_RISCV_LINUX_CLANG_ONLY := --gcc-toolchain=$(RISCV_TOOLCHAIN_INSTALL_PATH) --sysroot=$(RISCV_SYSROOT) -I$(RISCV_SYSROOT)/usr/include --target=riscv64 -march=rv64gc
+CXX_FLAGS_CUDA := --cuda-gpu-arch=$(GPU_ARCH) --cuda-path=$(CUDA_INSTALL_PATH) -L$(CUDA_INSTALL_PATH)/lib64 -lcudart -ldl -lrt -pthread
+CXX_FLAGS_RISCV_CUDA := -static --cuda-gpu-arch=$(GPU_ARCH) --cuda-path=$(CUDA_INSTALL_PATH) -L. -Wl,-static -l$(CUSTOM_CUDA_LIB) -L$(CUDA_INSTALL_PATH)/lib64 -Wl,-static -ldl -lrt -pthread
+CXX_FLAGS_CUDA_WRAPPER := -include __clang_cuda_runtime_wrapper.h
+
+all: helloworld vecadd
+
+test_llvm_env:
+ifeq ($(LLVM_INSTALL_PATH),)
+	$(error env LLVM_INSTALL_PATH is not set)
+endif
+
+test_vanadis_envs:
+ifeq ($(RISCV_TOOLCHAIN_INSTALL_PATH),)
+	$(error env RISCV_TOOLCHAIN_INSTALL_PATH is not set)
+endif
+ifeq ($(GPU_ARCH),)
+	$(error env GPU_ARCH is not set)
+endif
+
+# Sample helloworld program, pure CPU code
+# test if clang is working properly
+helloworld: helloworld.c test_llvm_env
+	$(CLANG_PREFIX)clang -static -mno-relax $(C_FLAGS_RISCV_LINUX) $< -o $@
+
+# Custom CUDA runtime target
+# Have to embed the CUDA version information for GPGPU-Sim
+vanadis_cuda: $(CUSTOM_CUDA_LIB_SRC) test_vanadis_envs
+	$(RISCV_PREFIX)gcc -c -static -I$(CUDA_INSTALL_PATH)/include -I. -DBALAR_CUDA_VERSION=\"libcudart_vanadis.a.$(shell echo $$CUDA_VERSION_NUMBER | cut -c 1-2)\" -fpic $<
+	$(RISCV_PREFIX)gcc -shared -o lib$(CUSTOM_CUDA_LIB).so $(CUSTOM_CUDA_LIB_SRC:.cc=.o)
+	$(RISCV_PREFIX)ar rcs lib$(CUSTOM_CUDA_LIB).a $(CUSTOM_CUDA_LIB_SRC:.cc=.o)
+
+# RISCV + CUDA binary
+vecadd: vecadd.cu test_vanadis_envs test_llvm_env vanadis_cuda
+	$(CLANG_PREFIX)clang++ $< -o $@ $(CXX_FLAGS_RISCV_CUDA) $(CXX_FLAGS_CUDA_WRAPPER) $(CXX_FLAGS_RISCV_LINUX)
+
+# x86 + CUDA binary
+vecadd_normal: vecadd.cu
+	$(CLANG_PREFIX)clang++ $< -o $@ $(CXX_FLAGS_CUDA_WRAPPER) $(CXX_FLAGS_CUDA) 
+
+clean:
+	rm -f lib$(CUSTOM_CUDA_LIB).so lib$(CUSTOM_CUDA_LIB).a $(CUSTOM_CUDA_LIB_SRC:.cc=.o) vecadd helloworld
+```
+
+For more complicated CUDA program, you can refer to the flags and CC setup in [gpu-app-collection's Makefile](https://github.com/accel-sim/gpu-app-collection/blob/7db54738af0aed3760f496f6c968ee5a40c0ee46/src/cuda/common/common.mk#L77-L106).
+
+:::note
+
+Vanadis currently only support programs that are statically linked.
+
+:::
diff --git a/docs/elements/balar/QuickStart.md b/docs/elements/balar/QuickStart.md
new file mode 100644
index 00000000..640e17c6
--- /dev/null
+++ b/docs/elements/balar/QuickStart.md
@@ -0,0 +1,181 @@
+---
+title: QuickStart
+---
+
+This page provides instructions to setup balar and run test examples.
+
+## Prerequisites
+
+### CUDA
+Please refer to [NVIDIA's website](https://developer.nvidia.com/cuda-downloads) for setting up CUDA.
+
+After CUDA is installed, you would need to set this environment variable `CUDA_INSTALL_PATH`:
+
+```bash
+# Assuming CUDA is installed at /usr/local/cuda
+export CUDA_INSTALL_PATH=/usr/local/cuda
+```
+
+### GPGPU-Sim
+First, we shall install the prerequisites for GPGPU-Sim:
+
+```bash
+sudo apt-get install cmake build-essential xutils-dev bison zlib1g-dev flex libglu1-mesa-dev
+```
+
+Then to build GPGPU-Sim:
+```bash
+git clone https://github.com/accel-sim/gpgpu-sim_distribution.git
+cd gpgpu-sim_distribution
+source setup_environment
+make -j4
+```
+
+### LLVM + RISCV GNU Toolchain
+
+If you wish to run CUDA binary with [vanadis](../vanadis/intro.md) and balar, you will need to clone LLVM and RISCV GNU toolchain to compile CUDA source code.
+
+```bash
+# Create installation dirs
+mkdir llvm-install
+mkdir riscv-gnu-install
+
+# Set up environment vars to LLVM and RISCV GCC installation folders
+export LLVM_INSTALL_PATH=$(pwd)/llvm-install
+export RISCV_TOOLCHAIN_INSTALL_PATH=$(pwd)/riscv-gnu-install
+
+# Build LLVM with RISC-V, x86, and CUDA support from source
+# x86 is included for testing purpose, you can remove it if
+# you will only run the CUDA binary with SST
+git clone https://github.com/llvm/llvm-project.git
+
+cd llvm-project
+mkdir build && cd build
+cmake -DLLVM_TARGETS_TO_BUILD="RISCV;X86;NVPTX" -DLLVM_DEFAULT_TARGET_TRIPLE=riscv64-unknown-linux-gnu \
+      -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_PROJECTS="clang;lld" -DCMAKE_INSTALL_PREFIX=$LLVM_INSTALL_PATH ../llvm
+cmake --build . -j8
+cmake --build . --target install
+cd ..
+
+# Build RISC-V GCC toolchain
+git clone https://github.com/riscv-collab/riscv-gnu-toolchain.git
+
+cd riscv-gnu-toolchain
+./configure --prefix=$RISCV_INSTALL_PATH
+make linux -j8
+cd ..
+
+# Match with the GPU config file we have (V100)
+export GPU_ARCH=sm_70
+```
+
+### GPU App Collection
+
+In order to run balar's unittest, we would need to clone the [GPU app collection repo](https://github.com/accel-sim/gpu-app-collection/tree/sst_support). The unittest script will handle compilation for these kernels with the custom CUDA runtime lib.
+
+```bash
+git clone git@github.com:accel-sim/gpu-app-collection.git
+cd gpu-app-collection
+git checkout sst_support
+
+# Setup environ vars for apps, need to have
+# env var LLVM_INSTALL_PATH and RISCV_TOOLCHAIN_INSTALL_PATH
+# If you plan to compile the apps directly, you will 
+# also need to set SST_CUSTOM_CUDA_LIB_PATH to 
+# the directory of the custom CUDA library,
+# which normally will be `SST_ELEMENTS_SRC/src/sst/elements/balar/tests/vanadisLLVMRISCV`
+source ./src/setup_environment sst
+```
+
+:::note Manual Compilation
+If you want to compile the Rodinia benchmarks manually or want to tested out other kernels in the gpu-app-collection repo, you will need to set the `SST_CUSTOM_CUDA_LIB_PATH` env var and compile the custom CUDA runtime first.
+
+```bash
+# Set SST_CUSTOM_CUDA_LIB_PATH
+export SST_CUSTOM_CUDA_LIB_PATH=SST_ELEMENTS_SRC/src/sst/elements/balar/tests/vanadisLLVMRISCV
+
+# Build custom CUDA runtime
+cd SST_ELEMENTS_SRC/src/sst/elements/balar/tests/vanadisLLVMRISCV
+make
+
+# Compile Rodinia 2.0 and pull data
+cd PATH_TO/gpu-app-collection
+make rodinia_2.0-ft -i -j4 -C ./src
+make data -C ./src
+
+# The compiled binaries would be located
+# PATH_TO/gpu-app-collection/bin/CUDA_VERSION_NUM/release
+```
+
+:::
+
+## Compilation
+
+There are some subtle details need to be taken care of for *sst-core* and *sst-elements*:
+
+```bash
+# For sst-core, you will need to disable MPI and mempools
+cd PATH_TO/SST_CORE_SOURCE/
+./configure --prefix=$SST_CORE_HOME --disable-mpi --disable-mem-pools
+make -j4
+make install
+
+# For sst-elements, you will need to specify CUDA and GPGPU-Sim path
+# GPGPUSIM_ROOT will be set by sourcing the setup_environment script
+cd PATH_TO/SST_ELEMENTS_SOURCE/
+./configure --prefix=$SST_ELEMENTS_HOME --with-sst-core=$SST_CORE_HOME --with-cuda=$CUDA_INSTALL_PATH --with-gpgpusim=$GPGPUSIM_ROOT
+make -j4
+make install
+```
+
+After configuring *sst-elements*, you should have a command line output stating balar will be built. If not, you would need to check if you have CUDA and GPGPU-Sim installed and compiled properly.
+
+## Testing
+
+Balar divides its testcases into three testsuites based on time cost:
+
+- *simple*: takes about 10 mins to complete
+- *medium*: takes about 1 hr to complete
+- *long*: takes 1~2 hrs to complete
+
+Each All of them can be run in parallel with `-c NUM_CORES` flags.
+
+```bash
+# Run simple tests sequentially
+$SST_CORE_HOME/bin/sst-test-elements -w "*balar*simple*"
+
+# Run medium testcases with 2 processes
+$SST_CORE_HOME/bin/sst-test-elements -c 2 -w "*balar*medium*"
+
+# Run long tests with 4 processes
+$SST_CORE_HOME/bin/sst-test-elements -c 4 -w "*balar*long*"
+
+# Run all tests with 8 processes
+$SST_CORE_HOME/bin/sst-test-elements -c 8 -w "*balar*"
+```
+
+When running each testsuite, it will first compiled the custom CUDA library under at `SST_ELEMENT_SOURCE/src/sst/elements/balar/tests/vanadisLLVMRISCV/` and link this with Rodinia 2.0 kernels in [gpu-app-collection](https://github.com/accel-sim/gpu-app-collection/tree/sst_support).
+
+## Running examples
+
+```bash
+# cd into balar's tests folder
+cd SST_ELEMENT_SOURCE/src/sst/elements/balar/tests
+
+# With testcpu
+make -C vectorAdd
+sst testBalar-testcpu.py --model-options="-c gpu-v100-mem.cfg -x ./vectorAdd/vectorAdd -t cuda_calls.trace"
+
+# With vanadis 
+# Run helloworld example, pure CPU code, no CUDA calls
+make -C vanadisLLVMRISCV
+vanadis_EXE=./vanadisLLVMRISCV/helloworld \
+vanadis_ISA=RISCV64 \
+sst testBalar-vanadis.py --model-options='-c gpu-v100-mem.cfg'
+
+# Run a simple integer vector add example
+vanadis_EXE=./vanadisLLVMRISCV/vecadd \
+vanadis_ISA=RISCV64 \
+BALAR_CUDA_EXE_PATH=./vanadisLLVMRISCV/vecadd \
+sst testBalar-vanadis.py --model-options='-c gpu-v100-mem.cfg'
+```
diff --git a/docs/elements/balar/TracingCUDAProgram.md b/docs/elements/balar/TracingCUDAProgram.md
new file mode 100644
index 00000000..0726e5fb
--- /dev/null
+++ b/docs/elements/balar/TracingCUDAProgram.md
@@ -0,0 +1,53 @@
+---
+title: Tracing CUDA Program
+---
+
+This page provides steps to use tracer tool to generate CUDA API traces to run *balar* with *BalarTestCPU*.
+
+:::warning
+
+We are working on providing a more robust version of the tracer with the new NVBit release. Including better trace format and better computation validation.
+
+:::
+
+In order to run *balar* in trace-driven mode, you will need to supply the *BalarTestCPU* component with a trace file and associated GPU memory copy data dump. We have created an instruction tracer tool based on [NVBit](https://github.com/NVlabs/NVBit) to generate those. You can find relevant code inside [Accel-Sim](https://github.com/accel-sim/accel-sim-framework) framework: `ACCEL_SIM_SRC/util/tracer_nvbit/others/cuda_api_tracer_tool`.
+
+:::note
+
+To setup the tool and generate traces, you will need a machine with a NVIDIA GPU installed. Also [NVBit](https://github.com/NVlabs/NVBit) has some requirements for both GPU hardware and software versions. Refer to its README for more info.
+
+:::
+
+To pull and compile the tracer tool:
+
+```bash
+# Get the Accel-Sim framework
+git clone git@github.com:accel-sim/accel-sim-framework.git
+
+# cd into tracer tool folder
+cd accel-sim-framework/util/tracer_nvbit
+
+# Install nvbit
+./install_nvbit.sh
+
+# Compile tracer tool
+# Which will generate a 'cuda_api_tracer_tool.so' file at
+# './others/cuda_api_tracer_tool/cuda_api_tracer'
+make -C ./others/cuda_api_tracer_tool
+```
+
+Then, in order to dump traces, put path to the tracer tool shared object in `LD_PRELOAD`:
+
+```bash
+LD_PRELOAD=PATH_TO/cuda_api_tracer_tool.so CUDA_PROG
+```
+
+Which will generate the following files when exiting:
+
+- `cuda_calls.trace`: the API trace file tracking
+    - `cudaMemcpy`
+    - `cudaMalloc`
+    - cuda kernel launches
+    - `cudaFree`
+- `cuMemcpyD2H-X-X.data`: cuda memcpy device to host data payload
+- `cuMemcpyH2D-X-X.data`: cuda memcpy host to device data payload
diff --git a/docs/elements/balar/intro.md b/docs/elements/balar/intro.md
index 3574cf07..1784cbf8 100644
--- a/docs/elements/balar/intro.md
+++ b/docs/elements/balar/intro.md
@@ -2,9 +2,16 @@
 title: balar
 ---
 
-The *BalarTestCPU* component is a trace-based test CPU that is included inside `balarMMIO` to run simulations with CUDA API call traces and data collected from a real GPU. It works by consuming a trace file and associated CUDA memory copy data files. The cudaMemcpyH2D data payload is collected for program correctness. The cudaMemcpyD2H data is collected to validate computation.
+The *balar* library provides an interface between SST and [GPGPU-Sim](https://github.com/accel-sim/gpgpu-sim_distribution), a cycle-level simulator modeling contemporary graphics processing units (GPUs) running GPU computing workloads written in CUDA. It supports two execution modes: trace-driven and direct-execution.
 
-*BalarMMIO* is responsible for relaying CUDA API requests from SST to GPGPU-Sim. Currently it supports running with CUDA traces without a real CPU model (with BalarTestCPU) or with a [Vanadis](../vanadis/intro) core (under active development with limited support for MIPS32 and no support for RV64).
+- Trace-driven: balar is driven by a test CPU that consume CUDA API traces and launch CUDA calls.
+- Direct-execution: a CPU model execute a CUDA binary linked with custom `libcudart` and dispatch CUDA API calls to balar and GPGPU-Sim.
+
+The *BalarTestCPU* component is a trace-based test CPU that is included inside balar folder (`./testcpu/`) to run simulations with CUDA API call traces and data collected from a real GPU. It works by consuming a trace file and associated CUDA memory copy data files. The `cudaMemcpyH2D` data payload is collected for program correctness. The `cudaMemcpyD2H` data is collected to validate computation.
+
+*BalarMMIO* is responsible for relaying CUDA API requests from SST to GPGPU-Sim. Currently it supports running with CUDA traces without a real CPU model (with BalarTestCPU) or with a [Vanadis](../vanadis/intro) core running RISCV + CUDA binary with a custom CUDA runtime (`libcudart_vanadis` inside `./tests/vanadisLLVMRISCV/`). This mode has been tested with a subset of Rodinia 2.0 benchmark kernels in unittest.
+
+The *dmaEngine* component performs memory data transfers between SST cache memory space and simulator memory space. It is required as balar will read/write the CPU data (i.e. `cudaMemcpy()` with vanadis) and place them into GPGPU-Sim's memory space for functional simulation. In addition, dmaEngine is also used to read CUDA dispatch packet and write return value for the custom CUDA runtime.
 
 :::note At a Glance
 
@@ -16,13 +23,32 @@ The *BalarTestCPU* component is a trace-based test CPU that is included inside `
 
 :::
 
-### Required dependencies
-* **CUDA** Version 10.1 is recommended
-* **GPGPUSim** A link to the GPGPUSim version can be found in balar's [README](https://github.com/sstsimulator/sst-elements/tree/master/src/sst/elements/balar/README.md).
+:::warning
 
-### Optional dependencies
-*None*
+Support for trace-driven mode currently is limited as it was used as early stage validation for balar implementation. It has only been tested with a simple integer vector add example.
 
-## Additional documentation
-A detailed [README](https://github.com/sstsimulator/sst-elements/tree/master/src/sst/elements/balar/README.md) on the balar element can be found in the balar source code.
+We are working on providing a more robust version of this with the new NVBit release. Including better trace format and better computation validation.
 
+:::
+
+### Required dependencies
+* **CUDA** Version 11.0+ is recommended
+* **GPGPUSim** Use the *dev* branch from [accel-sim/gpgpu-sim_distribution](https://github.com/accel-sim/gpgpu-sim_distribution)
+
+### Optional dependencies
+* [**LLVM**](https://github.com/llvm/llvm-project) For compiling RISCV + CUDA binary
+* [**RISCV GNU Toolchain**](https://github.com/riscv-collab/riscv-gnu-toolchain) For compiling RISCV + CUDA binary
+* [**gpu-app-collection**](https://github.com/accel-sim/gpu-app-collection/tree/sst_support) For running unittest with Rodinia 2.0 kernels
+* **Test Docker image** You can also opt in the prebuilt docker image with all dependencies setup except for GPGPU-Sim
+  * OS: Ubuntu 22.04
+  * CUDA: 11.7
+  * LLVM: 18.1.8
+  * RISCV: 2024.08.06-nightly
+  * ```bash
+    # Pull prebuilt image
+    docker pull tgrogers/accel-sim_regress:SST-Integration-Ubuntu-22.04-cuda-11.7-llvm-18.1.8-riscv-gnu-2024.08.06-nightly
+    ```
+
+import DocCardList from '@theme/DocCardList';
+
+<DocCardList />
diff --git a/website/docusaurus.config.js b/website/docusaurus.config.js
index baf3cb54..9a2e829b 100644
--- a/website/docusaurus.config.js
+++ b/website/docusaurus.config.js
@@ -131,7 +131,14 @@ module.exports= async function createConfigAsync() {
       "apiKey": "21b099cfb417d9b97f19045cc7c20cfe",
       "indexName": "sst-simulator",
       "placeholder": "Search Docs (not working yet)"
-    }
-  }
+    },
+    prism: {
+      additionalLanguages: ['bash', 'makefile'],
+    },
+  },
+  markdown: {
+    mermaid: true,
+  },
+  themes: ['@docusaurus/theme-mermaid'],
 };
 };
diff --git a/website/package.json b/website/package.json
index 155cfc62..21a6cd82 100644
--- a/website/package.json
+++ b/website/package.json
@@ -15,13 +15,14 @@
     "@docusaurus/core": "3.3.2",
     "@docusaurus/plugin-client-redirects": "3.3.2",
     "@docusaurus/preset-classic": "3.3.2",
+    "@docusaurus/theme-mermaid": "^3.3.2",
     "@mdx-js/react": "^3.0.0",
     "clsx": "^1.2.1",
     "katex": "^0.16.9",
     "prism-react-renderer": "^2.1.0",
     "react": "^18.2.0",
     "react-dom": "^18.2.0",
-    "rehype-katex":"^7.0.0",
+    "rehype-katex": "^7.0.0",
     "remark-math": "^6.0.0"
   },
   "devDependencies": {
diff --git a/website/sidebars.js b/website/sidebars.js
index 80f6e5ae..0583bb61 100644
--- a/website/sidebars.js
+++ b/website/sidebars.js
@@ -640,7 +640,18 @@ const sidebars = {
   elements: [
     'elements/intro',
     "elements/ariel/intro",
-    "elements/balar/intro",
+    /* Balar */
+    {
+      type: 'category',
+      label: 'balar',
+      link: {type: 'doc', id: 'elements/balar/intro'},
+      items: [
+        "elements/balar/QuickStart",
+        "elements/balar/TracingCUDAProgram",
+        "elements/balar/CompilingRISCVCUDA",
+        "elements/balar/BalarInDepth",
+      ]
+    },
     "elements/cacheTracer/intro",
     "elements/cassini/intro",
     "elements/cramsim/intro",