Allows you to run llama.cpp with ROCm acceleration on most Radeon RX Vega/5000/6000/7000, even those not on AMD's official ROCm supported GPU list.
This is a Linux container which builds llama.cpp with ROCm support and uses llama-swap to serve models.
I just put this together from other people's work listed below.
Reddit comment that Debian Bookworm Backports kernel contains the ROCm kernel interface, and Debian Trixie contains the userspace:
Instructions on Debian-AI list to compile llama.cpp with ROCm:
llama.cpp - efficient CPU and GPU LLM inference server:
llama-swap - OpenAI-compatible server to serve models and swap/proxy inference servers:
Linux with the amdgpu
driver ROCm interface enabled. Distros with this already included by default are: Debian Bookworm Backports, Debian Trixie/Sid, and Ubuntu 24.04. For other distros you might need to use the amdgpu-install
script from the AMD website.
Make sure your GPU is on the Debian ROCm supported GPU list in Trixie/Sid. The Bookworm Backports kernel has the same support level as Trixie.
Add your user to the video
and render
groups on your system: usermod -aG video,render "$USER"
. Log out and log in again. Confirm with the groups
command.
Look up your GPU in the LLVM amdgpu targets, or look at your GPU's code name in rocminfo
, and replace my gfx1010
in the Containerfile
with your GPU's architecture name.
Build the container:
podman build . -t rocswap
Deploy the container:
podman run -dit -p 8080:8080 --name rocswap \
-v ./models:/models \
-v ./config.yaml:/config.yaml \
--device /dev/dri --device /dev/kfd \
--group-add keep-groups \
--user 1000:1000 \
rocswap
If you have models which are smaller than your VRAM (minus about 1 GiB for other allocations) then you can keep -ngl 99
in the server config to load all layers on the GPU.
If you are running a model larger than your GPU's VRAM, then use the llama-swap llama.cpp log output (http://localhost:8080/logs) and the radeontop
commandline program to load as many layers as you can with the llama.cpp -ngl
option without overflowing VRAM. The other layers will run on the CPU.
For example, I have a Radeon RX 5600 XT 6Gb. I can load all of small models like Gemma-2-2B-it or Phi-3.5-mini-instruct (4B) on the GPU. To load a larger model like Llama-3.1-8B-Q6KL, I can only load 24 layers of the model's 33 layers so I use -ngl 24
.