Skip to content

hardyyb2/ImagePDF_data_extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Data Extractor

PDF data extractor can be used to extract any kind of required data from image based PDFs.

Current Purpose - Currently it is being used to extract phone numbers from the PDFs.

SETUP

All steps should be followed from project root :

  • Install latest Python release (3.9.5 at the time of writing).
    Download Python
  • Add Python to your system path if on windows
    Add to Path
  • Install pip
  • Install virtualenv with
    pip install virtualenv
  • Create a virtual env in the project root with
    virtualenv env
  • Install all dependencies with
    pip install -r requirements.txt
  • Install tesseract on your system
    • Windows
    • brew install tesseract on Mac

GET STARTED

  • After installing all the dependencies activate the virtual enviroment with
    • source env/bin/activate on Mac
    • env\Scripts\activate on Windows
  • After activation, in the command line enter
    export FLASK_APP=app and export FLASK_ENV=development
  • Now run with
    flask run
  • Server runs at localhost:5000

HOW TO USE

  • Once the server is running on localhost:5000, open in browser, upload the PDF and submit.

    Average time - 1 min/mb (PDF file)

  • Alternatively, send a HTTP POST request to /phonenumbers with form-data field named 'file' and attach the PDF to it.

HOW IT WORKS

  • The given PDF is scanned and converted to png images using PyMuPDF library.
  • These images are then evaluated with pytesseract which uses tesseract-OCR under the hood to recognize letters from images (OCR technology).
  • We then pass the extracted text through our function which filters out phone numbers.

Various other functions can be used to extract other kinds of data from PDF.

About

Extract relevant data from Image-based PDFs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published