Not extracting tabular data #819
-
How to extract only tabular data from a pdf in which columns are separated by tabs and in some pages there are multiple tabular data. I have tried extract.table and extract.tables, but I get an empty list or none in return. I have tried extract.text also; however, it is showing complete (text + tabular) data. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 8 replies
-
Hi @88arvin, and thanks for your interest in this library. Unfortunately, without seeing a specific PDF, it's difficult to provide guidance. Can you attach the PDF you're working with? |
Beta Was this translation helpful? Give feedback.
-
Hi @88arvin Appreciate your interest in the library. Thanks for sharing the PDF. To extract the table, you should use the {
"vertical_strategy": "explicit",
"horizontal_strategy": "text",
"explicit_vertical_lines": [60, 120, 300, 400, 500, 560, 660, 750]
} Feel free to tweak as you see fit. You can find more options here. Using the above strategy, it will give you only one table with all the data. You will have to apply some post processing to remove the unwanted rows. For example, a regex based approach could be to keep only those rows that have the first column as a date. It will give you all the rows that are having a transaction. Furthermore, a word of caution, the PDF you shared hasn't been properly redacted. The data that you have masked by using the black highlighter, is still very much visible. One can just select the text, copy and paste. Since it is a bank statement, my suggestion would be to properly redact and hide any sensitive information. This advice may be ignored provided the necessary consent has been taken from the user. |
Beta Was this translation helpful? Give feedback.
-
Thank you soo much. :) |
Beta Was this translation helpful? Give feedback.
Hi @88arvin Appreciate your interest in the library. Thanks for sharing the PDF. To extract the table, you should use the
explicit_vertical_lines
table strategy. For the PDF you shared, a setting might beFeel free to tweak as you see fit. You can find more options here. Using the above strategy, it will give you only one table with all the data. You will have to apply some post processing to remove the unwanted rows. For example, a regex based approach could be to keep only those rows that have the first column as a date. It will give you all …