Skip to content

aws-samples/amazon-sagemaker-llama2-response-streaming-recipes

Amazon SageMaker Llama 2 Inference via Response Streaming

Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. The fine-tuned LLMs, called Llama-2-Chat, are optimized for dialogue use cases. Llama 2 outperforms other open source language models on many external benchmarks, including reasoning, coding, proficiency, and knowledge tests.

This repo helps customers looking to have faster response times in the form of TTFB and thus reduce the overall perceived latency. The streaming support is possible with the latest announcement Sagemaker Real-time Inference now supports response streaming.

The samples covers notebook recipes on how to implement Response Streaming SageMaker Endpoints for Llama 2 LLMs. These models were deployed using the Amazon SageMaker Deep Learning Containers HF TGI and DLC for LMI. To be precise, these are DLC for Large Model Inference and the recently announced Hugging Face DLC powered by Text Generation Inference.

This repo covers Deploy and Inference Llama 2 Models on SageMaker via Response Streaming.

Llama-2 Streaming Response

DLC Model ID Deploy Notebook Inference Notebook
HF TGI meta-llama/Llama-2-7b-chat-hf Deploy Inference
HF TGI meta-llama/Llama-2-13b-chat-hf Deploy Inference
HF TGI meta-llama/Llama-2-70b-chat-hf Deploy Inference
LMI meta-llama/Llama-2-7b-chat-hf Deploy Inference
LMI meta-llama/Llama-2-13b-chat-hf Deploy Inference
LMI meta-llama/Llama-2-70b-chat-hf Deploy Inference

Blog

📖 Inference Llama 2 models with real-time response streaming using Amazon SageMaker

References

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.