-
Notifications
You must be signed in to change notification settings - Fork 866
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Use Case: Enhancing LLM Serving with Torch Compiled RAG on AWS Gravit…
…on (#3276) * RAG based LLM usecase * RAG based LLM usecase * Changes for deploying RAG * Updated README * Added main blog * Added main blog assets * Added main blog assets * Added use case to index html * Added benchmark config * Minor edits to README * Added new MD for Gen AI usecases * Added link to GV3 tutorial * Addressed review comments * Update examples/usecases/RAG_based_LLM_serving/README.md Co-authored-by: Matthias Reso <13337103+mreso@users.noreply.github.com> * Update examples/usecases/RAG_based_LLM_serving/README.md Co-authored-by: Matthias Reso <13337103+mreso@users.noreply.github.com> * Addressed review comments --------- Co-authored-by: Matthias Reso <13337103+mreso@users.noreply.github.com>
- Loading branch information
Showing
20 changed files
with
766 additions
and
12 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# TorchServe GenAI use cases and showcase | ||
|
||
This document shows interesting usecases with TorchServe for Gen AI deployments. | ||
|
||
## [Enhancing LLM Serving with Torch Compiled RAG on AWS Graviton](https://pytorch.org/serve/enhancing_llm_serving_compile_rag.html) | ||
|
||
In this blog, we show how to deploy a RAG Endpoint using TorchServe, increase throughput using `torch.compile` and improve the response generated by the Llama Endpoint. We also show how the RAG endpoint can be deployed on CPU using AWS Graviton, while the Llama endpoint is still deployed on a GPU. This kind of microservices-based RAG solution efficiently utilizes compute resources, resulting in potential cost savings for customers. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
# Deploy Llama & RAG using TorchServe | ||
|
||
## Contents | ||
* [Deploy Llama](#deploy-llama) | ||
* [Download Llama](#download-model) | ||
* [Generate MAR file](#generate-mar-file) | ||
* [Add MAR to model store](#add-the-mar-file-to-model-store) | ||
* [Start TorchServe](#start-torchserve) | ||
* [Query Llama](#query-llama) | ||
* [Deploy RAG](#deploy-rag) | ||
* [Download embedding model](#download-embedding-model) | ||
* [Generate MAR file](#generate-mar-file-1) | ||
* [Add MAR to model store](#add-the-mar-file-to-model-store-1) | ||
* [Start TorchServe](#start-torchserve-1) | ||
* [Query Llama](#query-rag) | ||
* [End-to-End](#) | ||
|
||
### Deploy Llama | ||
|
||
### Download Llama | ||
|
||
Follow [this instruction](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) to get permission | ||
|
||
Login with a Hugging Face account | ||
``` | ||
huggingface-cli login | ||
# or using an environment variable | ||
huggingface-cli login --token $HUGGINGFACE_TOKEN | ||
```bash | ||
python ../../large_models/Huggingface_accelerate/Download_model.py --model_path model --model_name meta-llama/Meta-Llama-3-8B-Instruct | ||
``` | ||
Model will be saved in the following path, `model/models--meta-llama--Meta-Llama-3-8B-Instruct`. | ||
|
||
### Generate MAR file | ||
|
||
Add the downloaded path to " model_path:" in `model-config.yaml` and run the following. | ||
|
||
``` | ||
torch-model-archiver --model-name llama3-8b-instruct --version 1.0 --handler ../../large_models/Huggingface_accelerate/llama/custom_handler.py --config-file llama-config.yaml -r ../../large_models/Huggingface_accelerate/llama/requirements.txt --archive-format no-archive | ||
``` | ||
|
||
### Add the mar file to model store | ||
|
||
```bash | ||
mkdir model_store | ||
mv llama3-8b-instruct model_store | ||
mv model model_store/llama3-8b-instruct | ||
``` | ||
|
||
### Start TorchServe | ||
|
||
```bash | ||
torchserve --start --ncs --ts-config ../../large_models/Huggingface_accelerate/llama/config.properties --model-store model_store --models llama3-8b-instruct --disable-token-auth --enable-model-api | ||
``` | ||
### Query Llama | ||
|
||
```bash | ||
python query_llama.py | ||
``` | ||
|
||
### Deploy RAG | ||
|
||
### Download embedding model | ||
|
||
``` | ||
python ../../large_models/Huggingface_accelerate/Download_model.py --model_name sentence-transformers/all-mpnet-base-v2 | ||
``` | ||
Model is download to `model/models--sentence-transformers--all-mpnet-base-v2` | ||
|
||
### Generate MAR file | ||
|
||
Add the downloaded path to " model_path:" in `rag-config.yaml` and run the following | ||
``` | ||
torch-model-archiver --model-name rag --version 1.0 --handler rag_handler.py --config-file rag-config.yaml --extra-files="hf_custom_embeddings.py" -r requirements.txt --archive-format no-archive | ||
``` | ||
|
||
### Add the mar file to model store | ||
|
||
```bash | ||
mkdir -p model_store | ||
mv rag model_store | ||
mv model model_store/rag | ||
``` | ||
|
||
### Start TorchServe | ||
``` | ||
torchserve --start --ncs --ts-config config.properties --model-store model_store --models rag --disable-token-auth --enable-model-api | ||
``` | ||
|
||
### Query RAG | ||
|
||
```bash | ||
python query_rag.py | ||
``` | ||
|
||
### RAG + LLM | ||
|
||
Send the query to RAG to get the context, send the response to Llama to get more accurate results | ||
|
||
```bash | ||
python query_rag_llama.py | ||
``` |
Oops, something went wrong.