OlehOleinikov / pdf_to_pandas_table_converter Public template

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Converting PDF table-like data to pandas dataframe

0 stars 0 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
WorkingTable.py		WorkingTable.py
demo.gif		demo.gif
main.py		main.py
range_generator.py		range_generator.py

Repository files navigation

PDF to pandas large table-like data convertion

Input files

Large PDF files (2k-4k pages per file). All files have a similar structure. Formed as a legacy from office tables.

Goals

get a file of table-like data, with the ability to use with pandas
prevent "out-of-memory" error
save the results to a file in a size-insensitive format
be able to track progress and estimated time

Solution

Count the number of pages of each file
Split page range for part-by-part convertion (single page makes slowly)
Convert part with attaching to common dataframe
Save complete dataframe to pickle file

About

Converting PDF table-like data to pandas dataframe

pdf pandas text-parser text-parsing tabula-py

Report repository

Releases

No releases published

Packages

No packages published

Languages

Python 100.0%