PDF data extractor can be used to extract any kind of required data from image based PDFs.
Current Purpose - Currently it is being used to extract phone numbers from the PDFs.
- Install latest Python release (3.9.5 at the time of writing).
Download Python - Add Python to your system path if on windows
Add to Path - Install pip
- Install virtualenv with
pip install virtualenv
- Create a virtual env in the project root with
virtualenv env
- Install all dependencies with
pip install -r requirements.txt
- Install tesseract on your system
- Windows
brew install tesseract
on Mac
- After installing all the dependencies activate the virtual enviroment with
source env/bin/activate
on Macenv\Scripts\activate
on Windows
- After activation, in the command line enter
export FLASK_APP=app
andexport FLASK_ENV=development
- Now run with
flask run
- Server runs at
localhost:5000
- Once the server is running on
localhost:5000
, open in browser, upload the PDF and submit.Average time - 1 min/mb (PDF file)
- Alternatively, send a HTTP POST request to /phonenumbers with form-data field named 'file' and attach the PDF to it.
- The given PDF is scanned and converted to png images using PyMuPDF library.
- These images are then evaluated with pytesseract which uses tesseract-OCR under the hood to recognize letters from images (OCR technology).
- We then pass the extracted text through our function which filters out phone numbers.
Various other functions can be used to extract other kinds of data from PDF.