This repository contains a Flask-based API that utilizes the Phi-2 model from Hugging Face's Transformers library to generate text based on prompts provided via HTTP requests. It's designed to showcase the capabilities of large language models in processing natural language queries and generating coherent, contextually relevant text responses.
- Docker: The application is containerized with Docker, ensuring easy setup and compatibility across different environments.
- RAM: Due to the size and computational requirements of the Phi-2 model, a system with at least 16 GB of RAM is recommended to run the application.
This application is designed to use the CPU for processing. Please note that it does not require nor utilize GPU resources. This makes it suitable for a wide range of hardware setups.
Before running the application, be aware that the initial setup involves downloading model data from Hugging Face's model repository. The Phi-2 model and its dependencies require approximately 6 GB of storage. Ensure you have a stable internet connection and sufficient disk space for the download and subsequent data storage
-
Clone the Repository
Start by cloning this repository to your local machine:
git clone https://github.com/zakariaf/phi2-text-generator-api.git cd phi2-text-generator-api
-
Build the Docker Image
With Docker installed and running, build the Docker image using:
docker build -t phi2-text-generator .
-
Run the Docker Container
After building the image, run the container with:
docker run -it -p 9001:9001 -v phi2-models:/model_cache phi2-text-generator
This command mounts a volume to cache the model data, reducing download times on subsequent runs.
Once the application is running, you can generate text by sending a POST request to the /generate
endpoint with a JSON payload containing your prompt. For example, using curl
:
curl -X POST http://127.0.0.1:9001/generate -H "Content-Type: application/json" -d "{\"prompt\":\"your prompt here\"}"
Replace "your prompt here"
with the text you want the model to respond to.
To see the Phi-2 Text Generator API in action, you can use the following curl
command to send a prompt about international football:
curl -X POST http://127.0.0.1:9001/generate -H "Content-Type: application/json" -d "{\"prompt\":\"In international football, which country is considered the strongest based on FIFA World Cup victories?\"}"
This request asks the model to identify the country considered the strongest in international football based on FIFA World Cup victories. Here's an example response you might receive:
{
"generated_text": "In international football, which country is considered the strongest based on FIFA World Cup victories?,\nAnswer: Brazil.\n\nExercise: In which year did Brazil win the FIFA World Cup for the fifth time?\nAnswer: 1958.\n\nExercise: How many times has Brazil won the FIFA World Cup?\nAnswer:"
}
The model provides an answer based on its training data, showcasing its ability to generate informative and contextually relevant responses. Note that the output may vary due to the probabilistic nature of the model's text generation process.
This section delves into the essential components and NLP concepts underpinning the Phi-2 Text Generator API. It aims to elucidate the code's functionality, from Flask setup to leveraging Hugging Face's Transformers library for text generation.
-
Flask App Initialization: Initializes the application as a Flask web service, which facilitates the handling of HTTP requests. This makes our API capable of receiving and responding to user queries with ease.
-
Route Definition: Establishes the
/generate
endpoint for POST requests where users can submit text prompts for the model to generate responses, showcasing Flask's utility in creating web services.
- Tokenizer and Model Loading: Utilizes AutoTokenizer and AutoModelForCausalLM for processing text inputs and generating responses. This demonstrates how the Transformers library simplifies working with complex NLP models.
-
Attention Mask: In Transformer models, the self-attention mechanism allows tokens to interact with each other. The attention mask is a binary tensor that indicates to the model which tokens should be focused on (1) and which are padding and should be ignored (0). This ensures that the model concentrates on the actual data in inputs of varied lengths, enhancing the relevance of the generated text.
-
Pad Token ID: Uniform input processing requires padding shorter sequences to match the longest sequence's length in a batch. The Pad Token ID specifies the token used for padding, ensuring the tokenizer and model treat padding consistently. Proper padding is crucial for the model to accurately interpret the input, preventing it from considering padding as meaningful content.
-
Extracting Prompts: Demonstrates how to extract prompts from POST request payloads, which the model uses as the basis for generating text responses.
-
Model Generation Call: Details the process of invoking the model's
generate
method with tokenized inputs and an attention mask. This includes setting generation parameters (e.g.,max_length
,temperature
,do_sample
) that influence the creativity and length of the output. -
Response Formatting: The generated tokens are decoded to text and returned as a JSON response, illustrating the end-to-end process of receiving a prompt, generating text, and sending a response.
The inclusion of the Attention Mask and Pad Token ID in our API ensures that inputs are accurately processed by the Phi-2 model, facilitating the generation of contextually relevant and coherent text. These elements are pivotal in harnessing the full capabilities of advanced language models, highlighting the sophisticated nature of modern NLP technologies.
The API's behavior can be customized by modifying the phi2_demo.py
script, including changing the port, adjusting model generation parameters, or altering the response format.
Contributions to improve the application or extend its capabilities are welcome. Please feel free to fork the repository, make your changes, and submit a pull request.