Skip to content

Commit

Permalink
Feature/add vllm deploy (#82)
Browse files Browse the repository at this point in the history
* support qlora

* upload dummy conversation data

* delete doc and docker

* update pyproject pip install package

* continue cleaning

* delete more files

* delete a format

* add llm_deploy

* add testing scripts

* update deployment readme

* update readme and fix some bug

* finalize the inference and deployment based on vllm

* Add babel deployment tutorial md

* add minor suggestions

* delete qlora_train.sh

* Delete duplicate data file

* Add tutorial for ssh tunnel

* Add fastchat api server tutorial

* Minor modification on the deployment tutorial

---------

Co-authored-by: lwaekfjlk <1125027232@qq.com>
  • Loading branch information
ruiyiw and lwaekfjlk authored Nov 6, 2023
1 parent 2fbcb77 commit aea8ad6
Showing 1 changed file with 68 additions and 24 deletions.
92 changes: 68 additions & 24 deletions llm_deploy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,37 +7,36 @@ Go to the vllm dir and pip install -e .
To notice https://github.com/vllm-project/vllm/issues/1283, need to modify the config file to "== 2.0.1" and the pytorch version if facing with CUDA version error.



## Deploy finetuned model on babel via vLLM
## Setting up Babel server
### Login with SSH key
1. Add public ed25519 key to server
Add public ed25519 key to server
```bash
ssh-copy-id -i ~/.ssh/id_ed25519.pub <username>@<mycluster>
```
2. Config ~~/.ssh/config
Config SSH file
```bash
Host <mycluster>
HostName <mycluster>
User <username>
IdentityFile ~/.ssh/id_ed25519
```
3. Login babel with SSH key
Login babel with SSH key
```bash
ssh <mycluster>
```

### Connecting to compute node
1. Jump from login node to compute node
### Connecting to a compute node
Jump from login node to compute node
```bash
srun --pty bash
```
2. Check if you can access the /data/folder
Check if you can access the /data/folder
```bash
cd /data/datasets/
```

### Config environment on compute node
1. Install miniconda
### Config environment on the compute node
Install miniconda
```bash
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
Expand All @@ -46,47 +45,92 @@ conda create --name myenv
conda activate myenv
# conda deactivate
```
2. Install vllm packages
Install vllm packages
```bash
conda install pip
pip install vllm
```
3. Submit gpu request and open a new terminal
Install fastchat packages
```bash
conda install pip
git clone https://github.com/lm-sys/FastChat.git
cd FastChat
pip3 install --upgrade pip
pip3 install "fschat[model_worker,webui]"
```
Submit gpu request and open a an interactive terminal
```bash
srun --gres=gpu:1 --time=1-00:00:00 --mem=80G --pty $SHELL
conda activate myenv
```
4. Useful commands for checking gpu jobs
Some useful commands for checking gpu jobs
```bash
# check slurm status
squeue -l
# check gpu status
nvidia-smi
# check gpu usage
pip install gpustat
watch -n 1 gpustat
# quit slurm jobs
scancel job_id
# connect to compute node directly
ssh -J babel babel-x-xx
```

### Host vLLM instance and run inference on server
1. Start vLLM surver with model checkpoint
### Install cuda-toolkit (optional)
Due to the issue with vllm: https://github.com/vllm-project/vllm/issues/1283, we need to use cuda-toolkit=11.7.0 that is compatible with Pytorch 2.0.1.
Install cuda-toolkit=11.7.0 on conda environment
```bash
conda install -c "nvidia/label/cuda-11.7.0" cuda-toolkit
```
Check cuda-toolkit version
```bash
nvcc -V
```

## Deploy models on Babel via FastChat API server
Implement the following python commands in three separate interactive terminal windows:
```bash
python3 -m fastchat.serve.controller
python3 -m fastchat.serve.model_worker --model-path model-checkpoint
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000
```
Call model checkpoint API
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "model-checkpoint",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
```
*Sample output:*
```JSON
{"id":"cmpl-GGvKBiZFdFLzPq2HdtuxbC","object":"text_completion","created":1698692212,"model":"checkpoint-4525","choices":[{"index":0,"text":"city that is known for its icon","logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":5,"total_tokens":11,"completion_tokens":6}}
```

## Deploy models on Babel via vllm API server
Start vLLM surver with model checkpoint
```bash
python -m vllm.entrypoints.openai.api_server --model model_checkpoint/
```
1. Call model checkpoint API
Call model checkpoint API
```bash
curl http://localhost:8000/v1/models
```
*Sample output:*
```JSON
{"object":"list","data":[{"id":"Mistral-7B-Instruct-v0.1/","object":"model","created":1697599903,"owned_by":"vllm","root":"Mistral-7B-Instruct-v0.1/","parent":null,"permission":[{"id":"modelperm-d415ecf6362a4f818090eb6428e0cac9","object":"model_permission","created":1697599903,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}
```
2. Inference model checkpoint API
Inference model checkpoint API
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "model_checkpoint/",
"model": "model_checkpoint",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
Expand All @@ -97,31 +141,31 @@ curl http://localhost:8000/v1/completions \
{"id":"cmpl-bf7552957a8a4bd89186051c40c52de4","object":"text_completion","created":3600699,"model":"Mistral-7B-Instruct-v0.1/","choices":[{"index":0,"text":" city that is known for its icon","logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7}}
```

### Access deployed babel server on a local machine
1. Construct ssh tunnel between babel login node and babel compute node with hosted model
## Access deployed Babel server on a local machine
Construct ssh tunnel between babel login node and babel compute node with hosted model
```bash
ssh -N -L 7662:localhost:8000 username@babel-x-xx
```
The above command creates a localhost:7662 server on bable login node which connects to localhost:8000 on compute node.

2. Construct ssh tunnel between local machine and babel login node
Construct ssh tunnel between local machine and babel login node
```bash
ssh -N -L 8001:localhost:7662 username@<mycluster>
```
The above command creates a localhost:8001 server on your local machine which connects to localhost:7662 on babel login node.

3. Call hosted model on local machine
Call hosted model on local machine
```bash
curl http://localhost:8001/v1/models
```
If the above command runs successfully, you should be able to use REST API on your local machine.

4. (optional) If you fail in building the ssh tunnel, you may add `-v` to the ssh command to see what went wrong.
(optional) If you fail in building the ssh tunnel, you may add `-v` to the ssh command to see what went wrong.




### Userful resource links for babel
## Userful resource links for babel
1. https://hpc.lti.cs.cmu.edu/wiki/index.php?title=BABEL#Cluster_Architecture
2. https://hpc.lti.cs.cmu.edu/wiki/index.php?title=VSCode
3. https://hpc.lti.cs.cmu.edu/wiki/index.php?title=Training_Material
Expand Down

0 comments on commit aea8ad6

Please sign in to comment.