-
I faced some PDFs who was digitized and transformed in a "readable" PDF (you can select chars and copy etc). But all this cases the metadata seems to be messed up because the text_extract returns a lot of special characters. Is there any way to identify these files? |
Beta Was this translation helpful? Give feedback.
Answered by
jsvine
Mar 22, 2023
Replies: 1 comment 1 reply
-
Unfortunately, I'm not aware of a foolproof way of identifying such PDFs. But you might be able to devise an heuristic that works for you, based on some combination of |
Beta Was this translation helpful? Give feedback.
1 reply
Answer selected by
luanmota
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Unfortunately, I'm not aware of a foolproof way of identifying such PDFs. But you might be able to devise an heuristic that works for you, based on some combination of
pdf.metadata
and the presence of full-page images. Haven't tried this myself, though.