Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. The fine-tuned LLMs, called Llama-2-Chat, are optimized for dialogue use cases. Llama 2 outperforms other open source language models on many external benchmarks, including reasoning, coding, proficiency, and knowledge tests.
This repo helps customers looking to have faster response times in the form of TTFB and thus reduce the overall perceived latency. The streaming support is possible with the latest announcement Sagemaker Real-time Inference now supports response streaming.
The samples covers notebook recipes on how to implement Response Streaming SageMaker Endpoints for Llama 2 LLMs. These models were deployed using the Amazon SageMaker Deep Learning Containers HF TGI and DLC for LMI. To be precise, these are DLC for Large Model Inference and the recently announced Hugging Face DLC powered by Text Generation Inference.
This repo covers Deploy and Inference Llama 2 Models on SageMaker via Response Streaming.
DLC | Model ID | Deploy Notebook | Inference Notebook |
---|---|---|---|
HF TGI | meta-llama/Llama-2-7b-chat-hf | Deploy | Inference |
HF TGI | meta-llama/Llama-2-13b-chat-hf | Deploy | Inference |
HF TGI | meta-llama/Llama-2-70b-chat-hf | Deploy | Inference |
LMI | meta-llama/Llama-2-7b-chat-hf | Deploy | Inference |
LMI | meta-llama/Llama-2-13b-chat-hf | Deploy | Inference |
LMI | meta-llama/Llama-2-70b-chat-hf | Deploy | Inference |
📖 Inference Llama 2 models with real-time response streaming using Amazon SageMaker
- Sagemaker Real-time Inference now supports response streaming
- Announcing the launch of new Hugging Face LLM Inference containers on Amazon SageMaker
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.