Identify PDFs who passed in a previous OCR #844

luanmota · 2023-03-22T04:00:59Z

luanmota
Mar 22, 2023

I faced some PDFs who was digitized and transformed in a "readable" PDF (you can select chars and copy etc). But all this cases the metadata seems to be messed up because the text_extract returns a lot of special characters. Is there any way to identify these files?

Answered by jsvine

Mar 22, 2023

Unfortunately, I'm not aware of a foolproof way of identifying such PDFs. But you might be able to devise an heuristic that works for you, based on some combination of pdf.metadata and the presence of full-page images. Haven't tried this myself, though.

View full answer

jsvine · 2023-03-22T15:10:49Z

jsvine
Mar 22, 2023
Maintainer

Unfortunately, I'm not aware of a foolproof way of identifying such PDFs. But you might be able to devise an heuristic that works for you, based on some combination of pdf.metadata and the presence of full-page images. Haven't tried this myself, though.

1 reply

luanmota Apr 1, 2023
Author

the ideia to use pdf.metadata helped a lot! thanks again @jsvine

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify PDFs who passed in a previous OCR #844

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Identify PDFs who passed in a previous OCR #844

luanmota Mar 22, 2023

Replies: 1 comment · 1 reply

jsvine Mar 22, 2023 Maintainer

luanmota Apr 1, 2023 Author

luanmota
Mar 22, 2023

Replies: 1 comment 1 reply

jsvine
Mar 22, 2023
Maintainer

luanmota Apr 1, 2023
Author