This guide provides detailed steps for deploying and serving LLMs on Intel CPU/GPU/Gaudi.
Please follow setup.md to setup the environment first.
We provide preconfigured yaml files in inference/models for popular open source models. You can customize a few configurations such as the resource used for serving.
To deploy on CPU, please make sure device
is set to CPU and cpus_per_worker
is set to a correct number.
cpus_per_worker: 24
device: CPU
To deploy on GPU, please make sure device
is set to GPU and gpus_per_worker
is set to 1.
gpus_per_worker: 1
device: GPU
To deploy on Gaudi, please make sure device
is set to hpu and hpus_per_worker
is set to 1.
hpus_per_worker: 1
device: HPU
LLM-on-Ray also supports serving with Deepspeed for AutoTP and BigDL-LLM for INT4/FP4/INT8/FP8 to reduce latency. You can follow the corresponding documents to enable them.
We support three methods to specify the models to be served, and they have the following priorities.
- Use inference configuration file if config_file is set.
python inference/serve.py --config_file inference/models/gpt2.yaml
- Use relevant configuration parameters if model_id_or_path is set.
python inference/serve.py --model_id_or_path gpt2 [--tokenizer_id_or_path gpt2 --port 8000 --route_prefix ...]
- If --config_file and --model_id_or_path are both None, it will serve all pre-defined models in inference/models/*.yaml, or part of them if models is set.
python inference/serve.py --models gpt2 gpt-j-6b
To deploy your model, execute the following command with the model's configuration file. This will create an OpenAI-compatible API (OpenAI API Reference) for serving.
python inference/serve.py --config_file <path to the conf file>
To deploy and serve multiple models concurrently, place all models' configuration files under inference/models
and directly run python inference/serve.py
without passing any conf file.
After deploying the model, you can access and test it in many ways:
# using curl
export ENDPOINT_URL=http://localhost:8000/v1
curl $ENDPOINT_URL/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": $MODEL_NAME,
"messages": [{"role": "assistant", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"}],
"temperature": 0.7
}'
# using requests library
python examples/inference/api_server_openai/query_http_requests.py
# using OpenAI SDK
# please install openai in current env by running: pip install openai>=1.0
export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY="not_a_real_key"
python examples/inference/api_server_openai/query_openai_sdk.py
This will create a simple endpoint for serving according to the port
and route_prefix
parameters in conf file, for example: http://127.0.0.1:8000/gpt2.
python inference/serve.py --config_file <path to the conf file> --simple
After deploying the model endpoint, you can access and test it by using the script below:
python examples/inference/api_server_simple/query_single.py --model_endpoint <the model endpoint URL>