This document aims to introduce various configurations and functionalities of this project using the Standard RAG process as an example. Additional documentation will be provided later to cover complex uses such as reproducing existing methods and detailed usage of individual components.
The Standard RAG process includes the following three steps:
- Retrieve relevant documents from the knowledge base based on the user's query.
- Incorporate the retrieved documents and the original query into a prompt.
- Input the prompt into the generator.
This document will demonstrate the RAG process using E5
as the retriever and Llama2-7B-Chat
as the generator.
To smoothly run the entire RAG process, you need to complete the following five preparations:
- Install the project and its dependencies.
- Download the required models.
- Download the necessary datasets (a toy dataset is provided).
- Download the document collection for retrieval (a toy corpus is provided).
- Build the index for retrieval (a toy index is provided).
To save time in getting started, we provide toy datasets, document collections, and corresponding indices. Therefore, you only need to complete the first two steps to successfully run the entire process.
Install the project and its dependencies using the following commands.
Note that if you encounter issues installing the vllm
, fschat
, or pyserini
packages, you can comment them out in the requirement.txt
file. These packages are required for certain functionalities, but omitting them temporarily won't affect the workflow described in this document.
git clone https://github.com/RUC-NLPIR/FlashRAG.git
cd FlashRAG
pip install -e .
You need to download the following two models:
- E5-base-v2
- Llama2-7B-Chat
You can download the models from Huggingface. If you are in China, it's recommended to use the mirror platform hf-mirror for downloading.
The datasets include queries and corresponding standard answers, allowing us to evaluate the effectiveness of the RAG system.
For simplicity, we have sampled 17 pieces of data from NQ as a toy dataset, located at examples/quick_start/dataset/nq. The subsequent RAG process will be conducted on these questions.
Our repository also provides a large number of processed benchmark datasets. You can visit our huggingface datasets to download and use them.
The document collection contains a large number of segmented paragraphs, serving as the external knowledge source for the RAG system. Since commonly used document collections are often very large (~5G or more), we use a general knowledge dataset as a toy collection, located at examples/quick_start/indexes/general_knowledge.jsonl。
Due to the small number of documents, many queries may not find relevant texts, which could affect the final retrieval results.
If you need to obtain the full document collection, you can visit our huggingface dataset to download and use them.
To improve retrieval efficiency, we often need to build the retrieval index in advance. For the BM25 method, the index is usually an inverted index (a directory in our project). For various embedding methods, the index is a Faiss database containing the embeddings of all texts in the document collection (an .index file). Each index corresponds to a corpus and a retrieval method, meaning that every time you want to use a new embedding model, you need to rebuild the index.
Here, we provide a toy index, built using E5-base-v2 and the aforementioned toy corpus.
If you want to use your own retrieval model and documents, you can refer to our index building document to build your index.
In the following steps, we will break down each step and demonstrate the corresponding code. The complete code will be provided at the end, or you can refer to the simple_pipeline.py file.
First, we need to load the Config
and fill in the paths of the previously downloaded items.
Config
manages all the paths and hyperparameters in the experiment. In FlashRAG, various parameters can be passed into the Config via a yaml file or a Python dictionary. The passed parameters will replace the default internal parameters. For detailed parameter information and their default values, you can refer to our basic_config.yaml
.
Here, we directly pass the paths via a dictionary.
from flashrag.config import Config
config_dict = {
'data_dir': 'dataset/',
'index_path': 'indexes/e5_Flat.index',
'corpus_path': 'indexes/general_knowledge.jsonl',
'model2path': {'e5': <retriever_path>, 'llama2-7B-chat': <generator_path>},
'generator_model': 'llama2-7B-chat',
'retrieval_method': 'e5',
'metrics': ['em', 'f1', 'acc'],
'retrieval_topk': 1,
'save_intermediate_data': True
}
config = Config(config_dict=config_dict)
Next, we need to load the dataset and pipeline.
The dataset can be automatically read through the previously set config; we only need to select the corresponding test set.
The pipeline loading requires us to select an appropriate pipeline based on the desired RAG process. Here, we choose SequentialPipeline
for the Standard RAG process mentioned earlier.
The pipeline will automatically load the corresponding components (retriever and generator) and complete various initializations.
from flashrag.utils import get_dataset
from flashrag.pipeline import SequentialPipeline
all_split = get_dataset(config)
test_data = all_split['test']
pipeline = SequentialPipeline(config)
After completing the above steps, we only need to call the pipeline's .run
method to run the RAG process on the dataset and generate evaluation results. This method returns a dataset containing intermediate results and final results, with the pred attribute containing the model's predictions.
Note that because we provided toy document collections and indices, the results might be relatively poor. Consider using your own document collections and indices for better results.
After the process completes, all results will be saved in a folder corresponding to the current experiment, including the retrieval and generation results for each query, overall evaluation scores, and more.
The complete code is as follows:
from flashrag.config import Config
from flashrag.utils import get_dataset
from flashrag.pipeline import SequentialPipeline
config_dict = {
'data_dir': 'dataset/',
'index_path': 'indexes/e5_Flat.index',
'corpus_path': 'indexes/general_knowledge.jsonl',
'model2path': {'e5': <retriever_path>, 'llama2-7B-chat': <generator_path>},
'generator_model': 'llama2-7B-chat',
'retrieval_method': 'e5',
'metrics': ['em','f1','acc'],
'retrieval_topk': 1,
'save_intermediate_data': True
}
config = Config(config_dict = config_dict)
all_split = get_dataset(config)
test_data = all_split['test']
pipeline = SequentialPipeline(config)
output_dataset = pipeline.run(test_data,do_eval=True)
print("---generation output---")
print(output_dataset.pred)