Skip to content

Latest commit

 

History

History
65 lines (38 loc) · 3 KB

README.md

File metadata and controls

65 lines (38 loc) · 3 KB

rocswap = llama.cpp + ROCm + llama-swap

Allows you to run llama.cpp with ROCm acceleration on most Radeon RX Vega/5000/6000/7000, even those not on AMD's official ROCm supported GPU list.

Contents

This is a Linux container which builds llama.cpp with ROCm support and uses llama-swap to serve models.

I just put this together from other people's work listed below.

Reddit comment that Debian Bookworm Backports kernel contains the ROCm kernel interface, and Debian Trixie contains the userspace:

Instructions on Debian-AI list to compile llama.cpp with ROCm:

llama.cpp - efficient CPU and GPU LLM inference server:

llama-swap - OpenAI-compatible server to serve models and swap/proxy inference servers:

Requirements

Linux with the amdgpu driver ROCm interface enabled. Distros with this already included by default are: Debian Bookworm Backports, Debian Trixie/Sid, and Ubuntu 24.04. For other distros you might need to use the amdgpu-install script from the AMD website.

Make sure your GPU is on the Debian ROCm supported GPU list in Trixie/Sid. The Bookworm Backports kernel has the same support level as Trixie.

Add your user to the video and render groups on your system: usermod -aG video,render "$USER". Log out and log in again. Confirm with the groups command.

Instructions

Look up your GPU in the LLVM amdgpu targets, or look at your GPU's code name in rocminfo, and replace my gfx1010 in the Containerfile with your GPU's architecture name.

Build the container:

podman build . -t rocswap

Deploy the container:

podman run -dit -p 8080:8080 --name rocswap \
  -v ./models:/models \
  -v ./config.yaml:/config.yaml \
  --device /dev/dri --device /dev/kfd \
  --group-add keep-groups \
  --user 1000:1000 \
  rocswap

If you have models which are smaller than your VRAM (minus about 1 GiB for other allocations) then you can keep -ngl 99 in the server config to load all layers on the GPU.

If you are running a model larger than your GPU's VRAM, then use the llama-swap llama.cpp log output (http://localhost:8080/logs) and the radeontop commandline program to load as many layers as you can with the llama.cpp -ngl option without overflowing VRAM. The other layers will run on the CPU.

For example, I have a Radeon RX 5600 XT 6Gb. I can load all of small models like Gemma-2-2B-it or Phi-3.5-mini-instruct (4B) on the GPU. To load a larger model like Llama-3.1-8B-Q6KL, I can only load 24 layers of the model's 33 layers so I use -ngl 24.

License