Skip to content

A CLI tool to find similar PDF in a given directory

Notifications You must be signed in to change notification settings

KrishavRajSingh/pdfsim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Similarity Matcher

The PDF Similarity Matcher is a command-line tool for finding and displaying PDF documents similar to a given input PDF based on extracted text features. It leverages text extraction and similarity comparison to help you identify relevant matches from a directory of PDFs.

Features

  • Extracts text from PDF files.
  • Processes and compares features from multiple PDFs.
  • Calculates similarity scores between an input PDF and PDFs in the directory.
  • Optionally displays detailed key-value feature information for similar PDFs.

Installation

Follow these steps to install and set up the PDF Similarity Matcher:

    pip install pdfsim

Usage

To find similar PDFs, use the following command:

pdfsim -d <directory_containing_pdf> -i <input_pdf> -t <top_n> [-kv]

Arguments

  • -d, --database (required): Path to the directory containing PDF files to compare against.
  • -i, --input (required): Path to the input PDF file you want to compare.
  • -t, --top (optional, default: 1): Number of top similar PDFs to display.
  • -kv (optional): Enable detailed key-value feature output for similar PDFs.

Contributing

Follow these steps to setup the project locally

  1. Clone the repository:

    git clone https://github.com/yourusername/pdfsim.git
    cd pdfsim
  2. Create a virtual environment:

    python3 -m venv venv
  3. Activate the virtual environment:

    • On Windows:

      venv\Scripts\activate
    • On macOS/Linux:

      source venv/bin/activate
  4. Install the required packages:

    pip install -r requirements.txt

    Ensure requirements.txt includes the necessary libraries:

    PyPDF2
    scikit-learn
    nltk
    

About

A CLI tool to find similar PDF in a given directory

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages