This is Marlin, a Mixed Auto-Regressive Linear kernel (and the name of one of the planet's fastest fish), an extremely optimized FP16xINT4 matmul kernel aimed at LLM inference that can deliver close to ideal (4x) speedups up to batchsizes of 16-32 tokens (in contrast to the 1-2 tokens of prior work with comparable speedup).
Additionally, it includes Sparse-Marlin, an extension of the MARLIN kernels adding support to 2:4 weight sparsity, achieving 5.3x speedups on NVIDIA GPUs (Ampere/Ada).
- NVIDIA GPU with compute capability >= 8.0 (Ampere or Ada, MARLIN is not yet optimized for Hopper)
The following bash prompts indicate where to execute each command\
🖥️ > # Local Machine
🐳 > # Docker Container
If ECC is enabled (e.g., on an A10), then the maximum achievable memory bandwidth will be 10-15% lower than in the official spec sheet as every memory requests will contain checksum overheads. This can be disabled via
🖥️ > sudo nvidia-smi -e 0
which we do in our A10 benchmarks.
🖥️ > docker load -i marlin_container.tar.gz
🖥️ > git clone --recurse-submodules https://github.com/IST-DASLab/marlin_artifact.git
🖥️ > cd marlin_artifact
🖥️ > docker build -t marlin_container . # about 30 minutes
In marlin_artifact
folder, run
🖥️ > docker run -it --rm --gpus all -v $(pwd)/result:/projects/result --name marlin marlin_container
🐳 > ./runme.sh # about 15 minutes
The results on Figures 1, 9, 11, and 12 can be found in the result
folder. Specifically, in figures peak_smarlin.pdf
, models.pdf
, and marlin_roofline.pdf
.
🐳 > ./test/runme.sh
Stop the docker container (only if running)
🐳 > exit
In order to reproduce our "sustainable performance" benchmarks, the GPU clocks need to be locked to their respective base values using. For instance, in the A10:
🖥️ > sudo nvidia-smi --lock-gpu-clocks=885,885 # BASE_GPU_CLOCK
In marlin_artifact
folder, rerun the container
🖥️ > docker run -it --rm --gpus all -v $(pwd)/result:/projects/result --name marlin marlin_container
inside the container, rerun the benchmark
🐳 > ./runme_sustained.sh # Check results on the result/ folder
[Optional] To reset the GPU again to the initial configuration
# stop the container
🐳 > exit
# run on your machine
🖥️ > sudo nvidia-smi --gpu-reset
We provide end-to-end benchmarks to evaluate the performance of different large language models (LLMs) using the vLLM framework.
Download LLM checkpoints. You can download the ones you want to test from Hugging Face and place them in the models
folder.
We use the following checkpoints for our evaluation.
In marlin_artifact
folder, run the docker container. This command is different from the previous one! The mounted folder is models
!
docker run --rm -it --gpus all -v $(pwd)/models:/projects/models --name marlin marlin_container
The following commands should all run inside the docker container.
Adjust the arguments and run the benchmark with e2e/batch_bench.py
.
Example Command
/root/miniconda3/envs/vllm/bin/python \
e2e/batch_bench.py \
--model-path="/projects/models/CHECKPOINT_PATH" \
--n-gpus=1 \
--batch-size-list 1 2 4 8 16 32 64 128 \
--n-in-tokens=64 \
--n-out-tokens=64 \
--n-warmup-reps=5 \
--n-reps=10 \
--min-runtime=-1 \
--vllm-gpu-memory-utilization=0.9 \
--vllm-enforce-eager=False
Notes
- Replace
CHECKPOINT_PATH
with the actual path to the model checkpoint you want to test. - Adjust
--n-gpus
to the number of GPUs you want to use. - Modify
--batch-size-list
according to the batch sizes you wish to evaluate. - If you encounter errors, consider tweaking
--vllm-gpu-memory-utilization
and--vllm-enforce-eager
to suit your hardware capabilities.
Argument Descriptions
To customize the benchmarking process using batch_bench.py
, you can adjust several command-line arguments as per your requirements.
-
--model-path
Specify the path to the model checkpoint folder (in Hugging Face format). ReplaceCHECKPOINT_PATH
with the actual directory name of the model checkpoint you downloaded. For example:--model-path="/projects/models/meta-llama/Llama-2-7b-chat-hf"
-
--n-gpus
Set the number of GPUs you want to utilize for testing. For instance, to use one GPU:--n-gpus=1
-
--batch-size-list
Provide a list of batch sizes you wish to test. Modify this list based on your testing needs. Example:--batch-size-list 1 2 4 8 16 32 64 128
-
--vllm-gpu-memory-utilization
Adjust the ratio of GPU memory reserved for vLLM. If you encounter CUDA out-of-memory errors due to temporary tensors, decrease this value. Increase it to reserve more memory for the key-value cache, allowing for larger batch sizes. Example:--vllm-gpu-memory-utilization=0.9
-
--vllm-enforce-eager
Decide whether to force vLLM to use eager mode. Setting it toFalse
enables CUDA Graph for better performance. Setting it toTrue
disables CUDA Graph, which may save GPU memory but could reduce speed. Example:--vllm-enforce-eager=False
Other options that are not necessary to change:
--n-in-tokens
Number of input tokens per prompt.--n-out-tokens
Number of tokens to generate.--n-warmup-reps
Number of warm-up iterations before benchmarking.--n-reps
Number of iterations after warm-up.--min-runtime
Minimum runtime in seconds after warm-up (set to a negative value to disable this option, set to a non-negative value to disable--n-reps
).
Output Metrics
This script should give you the total time to generate 2nd-64th tokens, measured in seconds. You can calculate the speed-ups by running this script on different models and then manually do the division.
Check the output (stdout). The metric is in the mean_time_exclude_first
field of the printed Python dictionary which looks like the following:
{'model_path': ..., 'n_gpus': ..., 'batch_size': ..., 'mean_time_exclude_first': ..., ...}
This benchmark requires a server process and a client process.
We recommend use a terminal multiplexer like screen
, and run the server and client process in different terminals.
To start a screen
session, run
SHELL=/bin/bash screen -S vllm
Below are some common screen
usages:
-
Open a new terminal: CTRL+A, then press C.
-
Switch between terminals: CTRL+A, then press N (next) or P (previous).
-
Exit a terminal: CTRL+D.
For more usages, please refer to screen
's documentation.
Example Command
/root/miniconda3/envs/vllm/bin/python \
-m vllm.entrypoints.openai.api_server \
--host=0.0.0.0 \
--port=8001 \
--model=/projects/models/CHECKPOINT_PATH \
--tensor-parallel-size=1 \
--gpu-memory-utilization=.9 \
--disable-log-requests
Notes
- Replace
CHECKPOINT_PATH
with the actual path to the model checkpoint you want to test. - Adjust
--tensor-parallel-size
to the number of GPUs you want to use. - If you encounter errors, consider tweaking
--gpu-memory-utilization
to suit your hardware capabilities. You can also optionally add--enforce-eager
flag to disable CUDA Graph. For more options, please refer to the vLLM documentation.
Wait
You can run the client process only after the server has started. Wait for the server to start until you see the output info:
INFO: Uvicorn running on http://0.0.0.0:8001 (Press CTRL+C to quit)
Example Command
/root/miniconda3/envs/vllm/bin/python \
e2e/qps_bench.py \
--host=localhost \
--port=8001 \
--model=/projects/models/CHECKPOINT_PATH \
--request-rate=1 \
--num-prompts=128 \
--seed=0
Notes
- Replace
CHECKPOINT_PATH
with the actual path to the model checkpoint you want to test. - Modify
--request-rate
and--num-prompts
according to the QPS and the testing time you wish to evaluate. number of prompts = request rate (QPS) * testing time (in seconds). We recomment to test for at least 128 seconds. - Requests are sent in randomized intervals. You may vary the random seed via
--seed
.
Output Metrics
This script should give you the Time to First Token (TTFT) and Time per Output Token (TPOT) metrics. Check the output (stdout)!