Skip to content

Not extracting tabular data #819

Answered by samkit-jain
88arvin asked this question in Q&A
Feb 20, 2023 · 3 comments · 8 replies
Discussion options

You must be logged in to vote

Hi @88arvin Appreciate your interest in the library. Thanks for sharing the PDF. To extract the table, you should use the explicit_vertical_lines table strategy. For the PDF you shared, a setting might be

{
    "vertical_strategy": "explicit",
    "horizontal_strategy": "text",
    "explicit_vertical_lines": [60, 120, 300, 400, 500, 560, 660, 750]
}

Feel free to tweak as you see fit. You can find more options here. Using the above strategy, it will give you only one table with all the data. You will have to apply some post processing to remove the unwanted rows. For example, a regex based approach could be to keep only those rows that have the first column as a date. It will give you all …

Replies: 3 comments 8 replies

Comment options

You must be logged in to vote
1 reply
@88arvin
Comment options

Comment options

You must be logged in to vote
6 replies
@samkit-jain
Comment options

@chanpreet90
Comment options

@samkit-jain
Comment options

@chanpreet90
Comment options

@samkit-jain
Comment options

Answer selected by samkit-jain
Comment options

You must be logged in to vote
1 reply
@samkit-jain
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
awaiting-code-or-pdf Issues and PRs awaiting code and/or a PDF from issue/PR-author
4 participants