Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization techniques #38

Merged
merged 8 commits into from
Jul 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 68 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,74 @@ echo "OPENAI_API_KEY=<your_openai_api_key>" > .env
poetry install
```

## Approach
I followed the following stesp to develop the RAG system and later perform optimization.

1. Project setup
2. Data preparation and loading
3. RAG system setup.
4. Evaluation pipeline setup using RAGAS.
5. Run and analyze baseline benchmark evaluation.
5. Identify areas of improvement.
7. Identify optimization techniques.
6. Implement optimization techniques.

### Project setup
I created a new project using poetry and added the necessary dependencies i.e Lanchain tools and RAGAS.

### Data preparation and loading

I used the CNN/Daily Mail dataset for this project. The dataset is available on the Hugging Face datasets library. I loaded the dataset using the `datasets` library and extracted the necessary fields for the RAG system.

```dataset = load_dataset("cnn_dailymail", "3.0.0", split="validation[:1000]")```

The line above loads the first 1000 examples from the validation split of the CNN/Daily Mail dataset.
The function to do this can found under `src/rag_pipeline/load_docs.py`

### RAG system setup
#### Basic Rag system
Having some experience with using ChromaDB vectorstore, I decided to use it for the initial setup of the RAG system.

I used the steps to setup my basic RAG system as follows:
1. Load documents: I loaded the dataset from csv file, I then retrieved the `article` column only for use as page_content to get my documents.
2. Split documents: Using langchain `RecursiveChararacterTestSplitter`, I split the documents into small chunks.
3. Create vectorstore: I used `langchain_chroma` to create a vectorstore from the split documents.
4. Setup LLM: I used OpenAI's gpt-3.5-turbo for testing the setup. I would then upgrade to gpt-4o when ready.
5. Create RAG chain that can be used to retrieve documents and generate answers. The RAG chain was simple using [`RetrievalQA`](https://docs.smith.langchain.com/old/cookbook/hub-examples/retrieval-qa-chain) from langchain.

#### Advancing the RAG system with best practices
I followed these steps to setup the RAG system and make it reusable and scalable:
1. Created a class `RAGSystem` that would be used to setup the RAG system. The class can be found under `src/rag_pipeline/rag_system.py`
2. Added the methods and classes i.e to load documents, split documents, create vectorstore, setup LLM, create RAG chain and more.
3. Usage: I could import the class and initialize as follows:
```
from src.rag_pipeline.rag_system import RAGSystem

rag_system = RAGSystem(
model_name = "gpt-4o",
embeddings = embeddings,
# Here you can add more parameters to customize the RAG system
)

rag_system.initialize()
```

#### Integrating pgvector for vectordatabase
I decided to integrate pgvector vectorstore for improved performance.
I followed the steps below to integrate pgvector:
1. Setup pgvector database:
- Install the necessary dependencies using poetry for pgvector including `langchain-pgvector` and `pgvector`.
- Using docker, I installed pgvector database which uses postgresql as the database.
- I created a docker-compose file to install the database. The file can be found under `docker-compose.yml` containing the pgvector service and the database service.
- Create a script to create `vector` extension and create embeddings table. The script is under `scripts/init.sql`. However, when using langchain-pgvector, the script is not necessary as the library will create the table and extension for us.
- I started the database using the command `docker compose up -d`.
- I wrote a make target to save this command. The target can be found under `Makefile` as `up`. Other commands can be found under the `Makefile` as well. The `Makefile` allows me to easily document and run commands critical to the project.

2. Add pgvector vectorstore to the RAG system




## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file
Expand Down
41 changes: 25 additions & 16 deletions misc/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,26 +10,35 @@ class Settings:
PAGE_CONTENT_COLUMN: Final = "article"

GENERATOR_TEMPLATE: Final = """
Use the following pieces of context to answer the question at the end.
These are the instruction to consider:
- Prioritize accuracy and conciseness in your response.
- Answer directly and avoid repeating information from the question.
- If the context doesn't contain the answer, just say that "I don't know".
- Don't try to make up an answer.
- Limit your answer to three sentences maximum, but aim for two if possible.
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
Context: {context}
Question: {question}
Helpful Answer:
"""

# GENERATOR_TEMPLATE: Final = """
# Use the following pieces of context to answer the question at the end.
# These are the instruction to consider:
# - Prioritize accuracy and conciseness in your response.
# - Answer directly and avoid repeating information from the question.
# - If the context doesn't contain the answer, just say that "I don't know".
# - Don't try to make up an answer.
# - Limit your answer to three sentences maximum, but aim for two if possible.

Example:
Context: The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. It is named after the engineer Gustave Eiffel, whose company designed and built the tower.
# Example:
# Context: The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. It is named after the engineer Gustave Eiffel, whose company designed and built the tower.

Question: Where is the Eiffel Tower located?
Answer: Paris, France
# Question: Where is the Eiffel Tower located?
# Answer: Paris, France

REMEMBER TO FOLLOW THE INSTRUCTIONS ABOVE.
# REMEMBER TO FOLLOW THE INSTRUCTIONS ABOVE.

Context: {context}
Question: {question}
Answer:
"""
# Context: {context}
# Question: {question}
# Answer:
# """

EVALUATION_FILE_PATH = "data/evaluation_sets/evaluation_set_20d20.csv"
EVALUAION_DATASET_NAME: Final = "CNN DailyMail Evaluation Dataset"
Expand Down
78 changes: 39 additions & 39 deletions notebooks/optimization_techniques/2_ensemble_retrievers.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
"from langchain.vectorstores import Chroma\n",
"from langchain.chains import RetrievalQA\n",
"from langchain_community.document_loaders import HuggingFaceDatasetLoader\n",
"from langchain.embeddings import HuggingFaceEmbeddings\n",
"from langchain_huggingface import HuggingFaceEmbeddings\n",
"from dotenv import load_dotenv"
]
},
Expand All @@ -43,7 +43,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load API keys"
"#### Load API keys"
]
},
{
Expand All @@ -59,7 +59,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Initialize embeddings and RAG system"
"### Rag Setup"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Initialize embeddings from huggingface - [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)"
]
},
{
Expand All @@ -68,7 +75,7 @@
"metadata": {},
"outputs": [],
"source": [
"# embeddings=HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')\n",
"# embeddings=HuggingFaceEmbeddings(model_name='sentence-transformers/all-mpnet-base-v2')\n",
"embeddings = OpenAIEmbeddings(api_key=openai_api_key, model='text-embedding-ada-002')"
]
},
Expand All @@ -79,16 +86,6 @@
"#### Initialize RAG system with ensemble_retriever with BM25 retriever"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"optimization_name = \"ensemble_retriever_with_bm25\"\n",
"optimization_no = 2"
]
},
{
"cell_type": "code",
"execution_count": 5,
Expand All @@ -99,8 +96,9 @@
"rag_system_ensemble = RAGSystem(\n",
" model_name = \"gpt-4o\",\n",
" existing_vectorstore = False,\n",
" use_ensemble_retriever = True,\n",
" embeddings=embeddings\n",
" # use_ensemble_retriever = True,\n",
" embeddings=embeddings,\n",
" clear_store=True\n",
")"
]
},
Expand All @@ -113,7 +111,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"--Split 1000 documents into 5030 chunks.--\n"
"--Split 1000 documents into 9817 chunks.--\n"
]
}
],
Expand All @@ -131,28 +129,28 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'question': 'What event is Rory McIlroy preparing for after the WGC-Cadillac Championship?',\n",
" 'answer': 'Rory McIlroy is preparing for the U.S. Masters at Augusta after the WGC-Cadillac Championship.',\n",
" 'contexts': ['(CNN)Jordan Spieth has Rory McIlroy and the world No.1 spot firmly in his sights after winning the Valspar Championship on Sunday. Spieth won a three-way play-off with a 28-foot birdie on the third extra hole to become only the fourth player since 1940 to win twice on the PGA Tour before turning 22. It is a feat that not even McIlroy mastered with Tiger Woods, Sergio Garcia and Robert Gamez the only players to have achieved that particular accolade in the past 75 years. But it is the Northern Irishman that is within Spieth\\'s focus heading towards Augusta. \"I like studying the game, being a historian of the game,\" Spieth told the PGA Tour website. \"It\\'s really cool to have my name go alongside those. \"But right now currently what I\\'m really focused on is Rory McIlroy and the No.1 in the world. That\\'s who everyone is trying to chase. \"That\\'s our ultimate goal to eventually be the best in the world and this is a great, great stepping stone. But going into the four majors of the year, to',\n",
" 'was good fun.\" Not that opportunity knocked for McIlroy when he chose the 3-iron to play his third shot to the 18th and final hole of the tournament and promptly found the water again. The Northern Irishman feigned to repeat his earlier antics, before placing it back in the bag. His mistake led to a double bogey six and left him tied for ninth at one-under-par, eight shots behind winner Dustin Johnson. McIlroy had promised to return the club to Trump after the round and was as good as his word. \"We\\'re thinking about auctioning it for charity or doing a trophy case for Doral, putting it on a beautiful mount,\" Trump said. Johnson is looking set to be one of McIlroy\\'s main rivals in the first major of the season, the U.S. Masters at Augusta, next month and his victory completed a triumphant comeback to the PGA Tour. The 30-year-old American took a six-month break from the Tour last July to cope with \"personal problems\" and returned earlier this year. Johnson finished with a three-under',\n",
" 'It raises money for two Florida hospitals named for the seven-time major winner and his late wife Winnie. \"I am so proud of what has been accomplished at the hospitals over the past 25 years. It is always a privilege to know that we are making a difference in the lives of families throughout the community,\" said Palmer after his medical center was named one of the best for children in the U.S. for 2014-15. He hurt his shoulder in December after tripping on carpet when he was about to make a speech at a PGA Tour father/son event. World No. 1 Rory McIlroy will make his first appearance at Palmer\\'s March 19-22 tournament, which features a restricted field, while top-five players Bubba Watson, Henrik Stenson, Adam Scott and Jason Day will also take part. Like us on Facebook .',\n",
" '(CNN)With a little bit of help from Donald Trump, Rory McIlroy was re-united with the golf club he famously threw into the lake at Doral -- but probably wished the golf-loving tycoon had not bothered. Never one to miss a media opportunity, Trump, the owner of the Blue Monster course in Florida, got a scuba diver to retrieve the 3-iron club which world No. 1 McIlroy had thought he had seen the last of during Friday\\'s second round at the WGC-Cadillac Championship. The 68-year-old American entrepreneur presented it to McIlroy before his final round Sunday, telling him that it was unlucky to continue playing with 13 clubs as against the usual 14 allowed under golf\\'s rules. \"He\\'s never one to miss an opportunity,\" McIlroy told the official PGA Tour website after his round. \"It was fine. It was good fun.\" Not that opportunity knocked for McIlroy when he chose the 3-iron to play his third shot to the 18th and final hole of the tournament and promptly found the water again. The Northern',\n",
" '(CNN)It was an act of frustration perhaps more commonly associated with golf\\'s fictional anti-hero Happy Gilmore than the world\\'s reigning No 1. player. But when Rory McIlroy pulled his second shot on the eighth hole of the WGC Cadillac Championship into a lake Friday, he might as well have been channeling the much loved Adam Sandler character. Before continuing his round with a dropped ball, the four-time major winner launched the 3-iron used to play the offending shot into the water as well. \"(It) felt good at the time,\" a rueful McIlroy later said of the incident in comments carried by the PGA Tour website. \"I just let frustration get the better of me. It was heat of the moment, and I mean, if it had of been any other club I probably wouldn\\'t have but I didn\\'t need a 3‑iron for the rest of the round so I thought, why not.\" The club \"must have went a good 60, 70 yards,\" he joked. McIlroy composed himself to finish with a second round of 70, leaving him one-under for the tournament']}"
"{'question': \"Who was one of Putin's harshest critics?\",\n",
" 'answer': \"One of Putin's harshest critics was Mikhail Khodorkovsky.\",\n",
" 'contexts': ['be heading an opposition party and do what I\\'m doing.\" Opinion: The complicated life and tragic death of Boris Nemtsov . Critics of Putin have in the past suffered miserable fates. Last year, a Moscow court sentenced five men to prison for the 2006 killing of Russian journalist and fierce Kremlin critic Anna Politkovskaya. Business magnate Mikhail Khodorkovsky accused Putin of corruption and spent 10 years in prison and labor camps. Late last year, Kremlin critic Alexey Navalny was found guilty',\n",
" 'Other critics of Putin who ended up dead . Putin has a history of viciously attacking the most important person in any given group of enemies, in order to send a message to the rest of them. In 2003, he did this by arresting and imprisoning the richest oligarch in the country, Mikhail Khodorkovsky. When Khodorkovsky was put on trial in 2004, Putin allowed the television cameras film the wealthiest man in the country sitting in a cage. Imagine that you were the 17th richest man in Russia, and',\n",
" \"look at some cases of outspoken critics of Putin's government who've ended up in exile, under house arrest, behind bars or dead. The business magnate backed an opposition party and accused Putin of corruption. He spent more than 10 years behind bars on charges of tax evasion and fraud. In statements to CNN, Khodorkovsky said his prosecution was part of a Kremlin campaign to destroy him and take control of Yukos, the oil company he built from privatization deals in the 1990s. The Kremlin denied\",\n",
" \"a journalist critical of Russia's war in Chechnya. She was gunned down at the entrance to her Moscow apartment in 2006. There was also business magnate Mikhail Khodorkovsky, who backed an opposition party and accused Putin of corruption. Khodorkovsky landed in jail after a conviction on tax fraud, which he said was a ploy to take away his oil company. The government rejected the claim. Putin pardoned him in 2013. Former Russian security agent Alexander Litvinenko was poisoned by a lethal dose\",\n",
" 'he said then. After Nemtsov was shot, Putin condemned the killing and ordered three law enforcement agencies to investigate the shooting, the Kremlin said in a statement. But critics of Putin have in the past suffered miserable fates. Last year, a Moscow court sentenced five men to prison for the 2006 killing of Russian journalist and fierce Kremlin critic Anna Politkovskaya. Business magnate Mikhail Khodorkovsky accused Putin of corruption and spent 10 years in prison and labor camps. Russian']}"
]
},
"execution_count": 8,
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"question = \"What event is Rory McIlroy preparing for after the WGC-Cadillac Championship?\"\n",
"question = \"Who was one of Putin's harshest critics?\"\n",
"result = rag_system_ensemble.rag_chain.invoke(question)\n",
"result"
]
Expand All @@ -166,27 +164,27 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--LOADING EVALUATION DATA--\n",
"--GETTING CONTEXT AND ANSWERS--\n",
"--EVALUATING LOCALLY--\n"
"--EVALUATING LOCALLY--\n",
"--GETTING CONTEXT AND ANSWERS--\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "a6274de6f8804585badb7a081b4ac730",
"model_id": "f5e981536dc049a48e3d79dbe75107ed",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Evaluating: 0%| | 0/76 [00:00<?, ?it/s]"
"Evaluating: 0%| | 0/80 [00:00<?, ?it/s]"
]
},
"metadata": {},
Expand All @@ -196,23 +194,25 @@
"name": "stdout",
"output_type": "stream",
"text": [
"--EVALUATION COMPLETE--\n"
"--EVALUATION COMPLETE--\n",
"--RESULTS SAVED--\n"
]
}
],
"source": [
"rag_results = run_ragas_evaluation(rag_system_ensemble.rag_chain)"
"rag_results = run_ragas_evaluation(\n",
" rag_chain=rag_system_ensemble.rag_chain,\n",
" save_results=True,\n",
" experiment_name=\"chunk_size_500_overlap_100\",\n",
")\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Save results to csv\n",
"rag_results.to_csv(f\"data/evaluation_results/bm_{optimization_no}_{optimization_name}.csv\")"
]
"source": []
}
],
"metadata": {
Expand Down
Loading
Loading