Skip to content

Latest commit

 

History

History
52 lines (37 loc) · 1.41 KB

README.md

File metadata and controls

52 lines (37 loc) · 1.41 KB

PDFscraper

PDFscraper uses PDFMiner and Python Tesseract to text mine pdfs.

Requirements

PDFscraper requires python 3.x

The following python packages are prerequisites:

  • pdfminer.six
  • pytesseract
  • chardet
  • Python Imaging Library (PIL) or Pillow
  • pdf2image

Other requirements: Install of Google Tesseract OCR and Poppler

Usage

usage: pdfscraper.py [-h] -i INPDF -o OUTTXT [-t]

optional arguments:
  -h, --help            show this help message and exit
  -i INPDF, --input-dir INPDF
                        Path to the input pdf files
  -o OUTTXT, --output-dir OUTTXT
                        Path for the output txt files
  -t, --token-gen       Use flag to generate tokenized output

E.g. To run

python pdfscraper.py -i /path/to/input/pdfs -o /path/to/output/directory

PDFscraper also has an optional flag -t, which produces tokenized text for use in Natural Language Processing (NLP) tasks. E.g. to produce tokenized output:

python pdfscraper.py -i /path/to/input/pdfs -o /path/to/output/directory -t

Docker

Alternatively, the accompanying Dockerfile can be used to run the program in a docker container.

E.g. To run

docker run -v "/path/to/input/pdfs:/data" --rm pdfscraper:latest -i /data -o /data