identify rectangle content (its not a word, image, curve) #641
pauljohn32
started this conversation in
Ask for help with specific PDFs
Replies: 1 comment
-
Hi @pauljohn32, and thanks for sharing this interesting example. Although these checkboxes are not Acrobat form elements, they are another kind of interactive interactive element, an "annotation." These can be accessed via In the linked and attached (discussion-641.ipynb.txt) notebook, you can see what the data for those annotations look like, and how they change when you've checked the checkboxes (as I have in this file, also referenced in the notebook: |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm practicing with pdfplumber to detect fields in US government forms. Example today is a tax form. These have lots of checkboxes.
Typically, the pdf are generated by tax prep software. The checkboxes are not Acrobat form elements, so far in my experience. I attach an example IRS form for testing.
f1040.pdf
.extract_text
and.extract_words
work as expected for me, except for these checkbox things. These are not extracted as text or words. These can be found in therects
for the page. So I know there's something in there, but what is it?The visualization tool works, using code below I show highlighted check boxes with ImageMagick tool in pdfplumber.
With the attached
f1040.pdf
file, I run these to make the picture:If I draw the
extract_words
output, the checkboxes are not selected. That's why I dig through therects
to find the coordinates that align. I hoped that the checkbox itself and the X mark inside it (if there is one) will be some Unicode characters.I appreciate your advice.
Beta Was this translation helpful? Give feedback.
All reactions