This repository contains an implementation of an advanced Retrieval-Augmented Generation (RAG) system, designed to handle and process semi-structured data extracted from PDF documents. It utilizes state-of-the-art NLP techniques along with custom preprocessing pipelines to parse, classify, and effectively retrieve content.
- PDF Parsing: Leverages the
unstructured
library to extract diverse elements such as text, tables, and images. - Data Processing: Processes extracted elements for optimal formatting and utility.
- Element Classification: Classifies elements to aid in further processing and retrieval tasks.
- Content Summarization: Utilizes advanced NLP models for summarizing extracted content.
- Content Retrieval: Employs a multi-vector retrieval system for efficient and relevant content fetching based on user queries.
- Storage Management: Manages storage and retrieval of processed and raw data efficiently.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
- Python 3.8 or higher
- pip (Python package installer)
- Clone the repository:
git clone https://github.com/yourusername/yourprojectname.git
- Navigate to the project directory:
cd yourprojectname
- Install the required dependencies:
pip install -r requirements.txt
python src/main.py
For a detailed guide on how to use this system and further documentation on the architecture and functionalities, please refer to the docs/ directory located within this project.