Deploy vLLM on Koyeb
Learn more about Koyeb
·
Explore the documentation
·
Discover our tutorials
Koyeb is a developer-friendly serverless platform to deploy apps globally. No-ops, servers, or infrastructure management.
This repository is designed to show how to deploy a vLLM instance to Koyeb. The Dockerfile
allows for configuration through environment variables to make deployment and configuration more straightforward. By default, the image deploys the meta-llama/Meta-Llama-3.1-8B-Instruct
image, but this is configurable using the MODEL_NAME
environment variable.
Follow the steps below to deploy a vLLM instance to your Koyeb account.
To use this repository, you need:
- A Koyeb account to build the
Dockerfile
and deploy it to the platform. If you don't already have an account, you can sign-up for free. - Access to GPU Instances on Koyeb. Join the preview today to gain access.
- Hugging Face account with a read-only API token. You will use this to fetch the models that vLLM will run. You may also need to accept the terms and conditions or usage license agreements associated with the models you intend to use. In some cases, you may need to request access to the model from the model owners on Hugging Face. For this guide, make sure you have accepted any terms required for the meta-llama/Meta-Llama-3.1-8B-Instruct.
The fastest way to deploy a vLLM instance is to click the Deploy to Koyeb button below.
Clicking on this button brings you to the Koyeb App creation page with most of the settings pre-configured to launch this application. You will need to configure the following environment variables:
HF_TOKEN
: Set this to your Hugging Face read-only API token.MODEL_NAME
: Set this to the name of the model you wish to use, as given on the Hugging Face site. You can check what models vLLM supports to find out more. Click the model name copy icon on the Hugging Face page to copy the appropriate value. If not provided, themeta-llama/Meta-Llama-3.1-8B-Instruct
model will be deployed.REVISION
: Set this to the model revision you wish to use. You can find available revisions in a drop down menu on the Files and versions tab of the Hugging Face model page. If not provided, the default revision for the given model will be deployed.VLLM_API_KEY
: This defines an authorization token that must be provided when querying the API. If not provided, unauthenticated queries will be accepted by the API.
Additionally, open the Health checks section and set the Grace period to 300 seconds to allow time for vLLM to fetch the model.
To modify this application example, you will need to fork this repository. Checkout the fork and deploy instructions.
If you want to customize and enhance this application, you need to fork this repository.
If you used the Deploy to Koyeb button, you can simply link your service to your forked repository to be able to push changes. Alternatively, you can manually create the application as described below.
On the Koyeb Control Panel, on the Overview tab, click the Create Web Service button to begin.
-
Select GitHub as the deployment method.
-
Choose the repository containing your application code.
-
Expand the Environment variables section and click Bulk edit to configure new environment variables. Paste the following variable definitions in the box:
HF_TOKEN= MODEL_NAME= REVISION= VLLM_API_KEY=
Fill out the values as described in the previous section.
-
In the Instance section, select the GPU category and choose RTX-4000-SFF-ADA.
-
In the Health checks section, set the Grace period to 300 seconds. This will provide time for vLLM to download the appropriate model from Hugging Face and initialize the server.
-
Click Deploy.
The repository will be pulled, built, and deployed on Koyeb. Once the deployment is complete, it will be accessible using the Koyeb subdomain for your service.
Use the following paths to interact with your instance:
/v1/models
: to access the list of models served by the vLLM instance./v1/completions
: to access the completions API/v1/chat/completions
: to access the chat API
If you have any questions, ideas or suggestions regarding this application sample, feel free to open an issue or fork this repository and open a pull request.