Skip to content

6. FAQ & Issues

AlpinDale edited this page Feb 3, 2024 · 2 revisions

Why is Aphrodite Engine using too much VRAM?

Aphrodite by default uses 90% of your GPU(s) VRAM. To limit this behaviour, set -gmu 0.5 or gpu_memory_utilization=0.5 to use 50% instead.

I keep running out of memory!

If you're using a quantized model (GPTQ, AWQ, etc), make sure you're correctly specifying this when launching the model via the -q flag, e.g. -q gptq or quantization=gptq.

If you're running a Mistral or Mixtral-based model, try limiting your model's max context length (as it launches with 32k) using the --max-model-len command; e.g. --max-model-len 8192 or max_model_len=8192.

In the event that this issue keeps persisting, try increasing or decreasing the --gpu-memory-utilization values. You can also use the FP8 KV cache option with --kv-cache-dtype fp8 to save memory even further. Note that this does not require an H100/4090 GPU.

How do I increase the model's context length?

Aphrodite supports automatic RoPE extension. Simply specify your desired context length with the --max-model-len option.

Is CUDA 12 supported?

If you install via the pip package, then yes. If building from source, CUDA 11.8 or lower is required at the moment.

I can only send 256 requests at once!

Try increasing --max-num-seqs - the default value is 256.

Windows support?

Windows users will have to rely on WSL for now. Native Windows support is currently being worked on.

Which AMD GPUs work?

Due to limitations from the upstream ROCm fork of Flash Attention, only a handful of datacenter-grade AMD GPUs are supported. Work is being done to support consumer AMD hardware.

Why does it take so long to start the engine on multiple GPUs?

There are multiple contributing factors:

  • Initializing a distributed environment involves setting up communication between the GPUs, which can take some time, depending on how the GPUs communicate with each other (NVLink, PCIe, etc).
  • Aphrodite Engine profiles the memory usage when initializing the KV cache. This involves running a forward pass with dummy inputs to profile the memory usage of the model, which can be significantly time-consuming, especially so in distributed environments.
  • Allocating memory blocks on the GPU and CPU can take a significant amount of time, especially when the number of blocks is large.

The increase in init time isn't necessarily linear with the number of GPUs because each additional GPU adds overhead for communication and synchronization between GPUs. The memory profiling and cache initialization steps are performed for each GPU, which can lead to an increase in init time.

I'm getting CUDA mismatch errors!

You might be seeing an error similar to this:

The detected CUDA version (12.1) mismatches the version that was used to compile PyTorch (11.8). Please make sure to use the same CUDA versions.

This is normally due to your environment referring to the global installation of CUDA and not the one in your current env. Run which nvcc and note down the output. For example, if your output is /home/anon/miniconda3/envs/aphrodite/bin/nvcc, run this command:


Don't forget to replace /home/anon with the actual path from the which nvcc output!

export CUDA_HOME=/home/anon/miniconda3/envs/aphrodite

Then run the installation command again.

No NVML device handle. Skipping nvlink detection.

On some GPU configurations, you may see this error:

ncclInternalError: Internal check failed.
Last error:
No NVML device handle. Skipping nvlink detection.

This happens if you're doing tensor parallelism (multi-GPU) on NVLinked NVIDIA GPUs and they don't support P2P. Please run this command before running the server:


Alternatively, you can prepend NCCL_P2P_DISABLE=1 to your server launch command.

Prometheus endpoint doesn't work!

This is likely due to Docker container port forwarding issues. We're trying to have the container scrape from the host, which is an unusual use-case and not easily fixed. An easy solution is to use cloudflared to forward your local port to a public URL.

Download the the binary from here and run the following:

chmod +x cloudflared-linux-amd64

./cloudflared-linux-amd64 tunnel --url localhost:2242

Then, edit prometheus.yaml, and make the following changes:

  scrape_interval: 1s
  evaluation_interval: 1s

  - job_name: aphrodite-engine
    metrics_path: /metrics
+   scheme: https
      - targets:
-          - 'host.docker.internal:2242'
+         - ''

Replace the URL appropriately, then launch the container with docker compose up.