Add vllm deploy (#45)

* support qlora * upload dummy conversation data * delete doc and docker * update pyproject pip install package * continue cleaning * delete more files * delete a format * add llm_deploy * add testing scripts * update deployment readme * update readme and fix some bug * finalize the inference and deployment based on vllm
sotopia-lab · Oct 11, 2023 · 23a19f6 · 23a19f6
1 parent 9c5205b
commit 23a19f6
Show file tree

Hide file tree

Showing 5 changed files with 33 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -4,5 +4,6 @@ We split our overall framework into multiple parts
 
 1. Data Processing --> Output general form of sotopia train and test data
 2. Together AI Finetuning --> Input the train and test data / Output model checkpoint
-3. LLM Finetuning --> Input the train and test data / Output model checkpoint + deployed model API form
-4. Eval --> Input model checkpoint / Output evaluation scores
+3. LLM Finetuning --> Input the train and test data / Output model checkpoint
+4. LLM Deplyment --> Input LLM Finetuned model checkpoint / Output Deployable OpenAI type API
+5. Eval --> Input model checkpoint / Output evaluation scores
diff --git a/llm_deploy/README.md b/llm_deploy/README.md
@@ -0,0 +1,5 @@
+We need to use an unmerged branch to support deploying lora-finetuned model. (the forked repo is https://github.com/troph-team/vllm.git)
+
+Go to the vllm dir and pip install -e .
+
+To notice https://github.com/vllm-project/vllm/issues/1283, need to modify the config file to "== 2.0.1" and the pytorch version if facing with CUDA version error.
diff --git a/llm_deploy/requirements.txt b/llm_deploy/requirements.txt
@@ -0,0 +1 @@
+vllm
diff --git a/llm_deploy/vllm_deploy.sh b/llm_deploy/vllm_deploy.sh
@@ -0,0 +1 @@
+python -m vllm.entrypoints.openai.api_server --model ../llm_ft/vicuna-7b-1.5
diff --git a/llm_deploy/vllm_inference_test.py b/llm_deploy/vllm_inference_test.py
@@ -0,0 +1,23 @@
+from vllm import LLM, SamplingParams
+from vllm.model_executor.adapters import lora
+
+# Create an LLM, need to change gpu memory utilization based on our need
+llm = LLM(model="../llm_ft/vicuna-7b-1.5", gpu_memory_utilization=0.5)
+
+# Add LoRA adapter
+lora.LoRAModel.from_pretrained(llm.llm_engine.workers[0].model, "../llm_ft/vicuna_checkpoints/checkpoint-1200")
+
+prompts = [
+    "Hello, my name is",
+    "The capital of France is",
+    "The future of AI is",
+]
+
+sampling_params = SamplingParams(temperature=0, top_k=-1)
+
+outputs = llm.generate(prompts, sampling_params)
+
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		python -m vllm.entrypoints.openai.api_server --model ../llm_ft/vicuna-7b-1.5