Not extracting tabular data #819

88arvin · 2023-02-20T14:33:40Z

88arvin
Feb 20, 2023

How to extract only tabular data from a pdf in which columns are separated by tabs and in some pages there are multiple tabular data.

I have tried extract.table and extract.tables, but I get an empty list or none in return. I have tried extract.text also; however, it is showing complete (text + tabular) data.

Answered by samkit-jain

Feb 21, 2023

Hi @88arvin Appreciate your interest in the library. Thanks for sharing the PDF. To extract the table, you should use the explicit_vertical_lines table strategy. For the PDF you shared, a setting might be

{
    "vertical_strategy": "explicit",
    "horizontal_strategy": "text",
    "explicit_vertical_lines": [60, 120, 300, 400, 500, 560, 660, 750]
}

Feel free to tweak as you see fit. You can find more options here. Using the above strategy, it will give you only one table with all the data. You will have to apply some post processing to remove the unwanted rows. For example, a regex based approach could be to keep only those rows that have the first column as a date. It will give you all …

View full answer

jsvine · 2023-02-20T14:46:36Z

jsvine
Feb 20, 2023
Maintainer

Hi @88arvin, and thanks for your interest in this library. Unfortunately, without seeing a specific PDF, it's difficult to provide guidance. Can you attach the PDF you're working with?

1 reply

88arvin Feb 21, 2023
Author

sample.pdf

samkit-jain · 2023-02-21T08:49:39Z

samkit-jain
Feb 21, 2023
Collaborator

Hi @88arvin Appreciate your interest in the library. Thanks for sharing the PDF. To extract the table, you should use the explicit_vertical_lines table strategy. For the PDF you shared, a setting might be

{
    "vertical_strategy": "explicit",
    "horizontal_strategy": "text",
    "explicit_vertical_lines": [60, 120, 300, 400, 500, 560, 660, 750]
}

Feel free to tweak as you see fit. You can find more options here. Using the above strategy, it will give you only one table with all the data. You will have to apply some post processing to remove the unwanted rows. For example, a regex based approach could be to keep only those rows that have the first column as a date. It will give you all the rows that are having a transaction.

Furthermore, a word of caution, the PDF you shared hasn't been properly redacted. The data that you have masked by using the black highlighter, is still very much visible. One can just select the text, copy and paste. Since it is a bank statement, my suggestion would be to properly redact and hide any sensitive information. This advice may be ignored provided the necessary consent has been taken from the user.

6 replies

samkit-jain Feb 24, 2023
Collaborator

Scanned statements won't work. You'll have to run an OCR on it and create a copyable PDF and then run on it.

chanpreet90 Feb 25, 2023

It is a copyable PDF. Only first (date) column is not being extracted.

samkit-jain Feb 26, 2023
Collaborator

Hi @chanpreet90 Please share the PDF so that it becomes easier to look into the issue. Also, might be a better option to create a new discussion around the issue.

chanpreet90 Mar 8, 2023

20010222002119 (1).pdf

samkit-jain Mar 9, 2023
Collaborator

The whole first page is an image and is rightly a proper scanned statement. Auto detection of line separators for table extraction won't work on this PDF.

88arvin · 2023-02-22T10:40:02Z

88arvin
Feb 22, 2023
Author

Thank you soo much. :)
I tried your suggested approach, and it is working well, but it is extracting only the rows that have dates in the rows.
The issue now is that this method excludes certain details from the particulars column because they are spread across numerous s lines that don't have dates.

1 reply

samkit-jain Feb 23, 2023
Collaborator

In that case, you can write your own custom logic. For example, if the rows are like

date	particulars	chq	withdrawal	deposit	balance
date	particular 1	chq	withdrawal	deposit	balance
	particular 2
date	particular 3	chq	withdrawal	deposit	balance
date	particular 4	chq	withdrawal	deposit	balance

You can have the logic as - Merge the rows i and i+n when row i+n+1 and row i are valid transaction rows and all the rows i+1 to i+n have only the particular columns filled.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not extracting tabular data #819

{{title}}

Replies: 3 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Not extracting tabular data #819

88arvin Feb 20, 2023

Replies: 3 comments · 8 replies

jsvine Feb 20, 2023 Maintainer

88arvin Feb 21, 2023 Author

samkit-jain Feb 21, 2023 Collaborator

samkit-jain Feb 24, 2023 Collaborator

chanpreet90 Feb 25, 2023

samkit-jain Feb 26, 2023 Collaborator

chanpreet90 Mar 8, 2023

samkit-jain Mar 9, 2023 Collaborator

88arvin Feb 22, 2023 Author

samkit-jain Feb 23, 2023 Collaborator

88arvin
Feb 20, 2023

Replies: 3 comments 8 replies

jsvine
Feb 20, 2023
Maintainer

88arvin Feb 21, 2023
Author

samkit-jain
Feb 21, 2023
Collaborator

samkit-jain Feb 24, 2023
Collaborator

samkit-jain Feb 26, 2023
Collaborator

samkit-jain Mar 9, 2023
Collaborator

88arvin
Feb 22, 2023
Author

samkit-jain Feb 23, 2023
Collaborator