The PDF Similarity Matcher is a command-line tool for finding and displaying PDF documents similar to a given input PDF based on extracted text features. It leverages text extraction and similarity comparison to help you identify relevant matches from a directory of PDFs.
- Extracts text from PDF files.
- Processes and compares features from multiple PDFs.
- Calculates similarity scores between an input PDF and PDFs in the directory.
- Optionally displays detailed key-value feature information for similar PDFs.
Follow these steps to install and set up the PDF Similarity Matcher:
pip install pdfsim
To find similar PDFs, use the following command:
pdfsim -d <directory_containing_pdf> -i <input_pdf> -t <top_n> [-kv]
- -d, --database (required): Path to the directory containing PDF files to compare against.
- -i, --input (required): Path to the input PDF file you want to compare.
- -t, --top (optional, default: 1): Number of top similar PDFs to display.
- -kv (optional): Enable detailed key-value feature output for similar PDFs.
Follow these steps to setup the project locally
-
Clone the repository:
git clone https://github.com/yourusername/pdfsim.git cd pdfsim
-
Create a virtual environment:
python3 -m venv venv
-
Activate the virtual environment:
-
On Windows:
venv\Scripts\activate
-
On macOS/Linux:
source venv/bin/activate
-
-
Install the required packages:
pip install -r requirements.txt
Ensure
requirements.txt
includes the necessary libraries:PyPDF2 scikit-learn nltk