This repository contains a complete implementation of Retrieval Augmented Generation (RAG) using Mistral-7B-Instruct-v0.1 for generating responses from a custom dataset. The main file app.py
sets up a Flask web server to provide an interface for querying the RAG model.
This implementation is done in two key phases: indexing and retrieval & generation. First, during indexing, documents are split into text chunks, and their embeddings are stored in a vector database. Then, in the retrieval & generation phase, user queries are matched against these embeddings, prompting the LLM to generate responses based on the retrieved contexts.
Why Mistral-7B?, Mistral-7B, especially in its 4-bit quantized version, offers impressive performance while being efficient in memory usage ( other open-source LLMs such as Llama2, Mindy-7B, MoMo-70B, etc can also be utilized). Here ‘Mistral-7B-Instruct-v0.2-Q4_K_M.gguf’ is used for the efficient retrieval and generation tasks.
The stack includes the LlamaIndex framework, which provides SentenceWindowNodeParser, VectorStoreIndex, ServiceContext, and SentenceTransformerRerank for powerful, breeze querying, and accessing domain-specific data, outperforming alternatives like Langchain.
Instead of fine-tuning an LLM model, the embedding-based retrieval ensures scalability and avoids issues like model drift, cost, and complexity.
How to assess the RAG system? Check out the evaluation process and benchmarks in detail here. It offers further insights into the methodology and performance metrics.
(Note: A system with at least 12 GB GPU and 16 GB RAM is recommended for optimal performance.)
Set up the environment and install the necessary dependencies. You can do this using the following steps:
-
Clone this repository to your local machine:
git clone https://github.com/ChanukaRavishan/MistralRAG-LlamaIndex.git
-
Navigate to the repository directory:
cd MistralRAG-LlamaIndex
-
Install the required Python packages. You may use a virtual environment to manage dependencies:
pip install -r requirements.txt
To create a vector storage follow the instructions in the file: 'LLaMaCPP_python_creating_vector_storage.ipynb' Here, I'm using huggingface 'bge-small-en-v1.5 model' for the embedding generation and storing the embeddings in the VectorStoreIndex provided by LlamaIndex.
To start the Flask web server and interact with the RAG model, run the following command:
nohup python app.py &
This will start the server locally on http://localhost:3000/
. You can now visit this URL in your web browser to access the interface or if you are running this implementation on a remote machine, you can visit http://<your-remote-ip-address>:3000/
.
-
GET /query: This endpoint accepts a query string as a parameter (
message
) and returns the response generated by the RAG model. Example usage:http://localhost:3000/query?message=your_query_here
The app.py
file contains the main implementation for setting up the Flask web server and integrating the RAG model. Here's a breakdown of its components:
-
Initialization Functions:
initialize_llm
: Initializes the LlamaCPP model with specified parameters.initialize_query_engine
: Initializes the query engine for executing queries with the RAG model.
-
Flask App Routes:
/
: Renders theindex.html
template in the templates folder./query
: Accepts a query string and returns the response generated by the RAG model.
- Ensure that you have the necessary models and resources available in the specified paths.
- Customize the configuration and parameters according to your requirements.
Feel free to explore and modify the code to suit your needs. If you encounter any issues or have suggestions for improvement, please don't hesitate to open an issue or contribute to the repository.
Thanks!