Original project is https://github.com/timClicks/slate . It is not supported Python3. I thank the original writer @timClicks and other contributors.
Slate is a Python package that simplifies the process of extracting text from PDF files. It depends on the PDFMiner package.
Slate provides one class, PDF. PDF takes a file-like object and will extract all text from the document, presentating each page as a string of text:
>>> with open('example.pdf', 'rb') as f: ... doc = slate.PDF(f) ... >>> doc [..., ..., ...] >>> doc[1] 'Text from page 2...'
If your pdf is password protected, pass the password as the second argument:
>>> with open('secrets.pdf', 'rb') as f: ... doc = slate.PDF(f, 'password') ... >>> doc[0] "My mother doesn't know this, but..."
If you would like access to the images, font files and other information, then take some time to learn the PDFMiner API.
- Getting simple things done, like extracting the text is quite complex. The program is not designed to return Python objects, which makes interfacing things irritating.
- It's an extremely complete set of tools, with multiple and moderately steep learning curves.
- It's not written with hackability in mind.