PDF Data Extractor

PDF data extractor can be used to extract any kind of required data from image based PDFs.

Current Purpose - Currently it is being used to extract phone numbers from the PDFs.

SETUP

After installing all the dependencies activate the virtual enviroment with
- source env/bin/activate on Mac
- env\Scripts\activate on Windows
After activation, in the command line enter
export FLASK_APP=app and export FLASK_ENV=development
Now run with
flask run
Server runs at localhost:5000

Once the server is running on localhost:5000, open in browser, upload the PDF and submit.

Average time - 1 min/mb (PDF file)
Alternatively, send a HTTP POST request to /phonenumbers with form-data field named 'file' and attach the PDF to it.

The given PDF is scanned and converted to png images using PyMuPDF library.
These images are then evaluated with pytesseract which uses tesseract-OCR under the hood to recognize letters from images (OCR technology).
We then pass the extracted text through our function which filters out phone numbers.

Various other functions can be used to extract other kinds of data from PDF.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.vscode		.vscode
pdfs		pdfs
static		static
templates		templates
.gitignore		.gitignore
Procfile		Procfile
README.md		README.md
app.py		app.py
extract_numbers.py		extract_numbers.py
requirements.txt		requirements.txt