Problem in extracting words by Font Style (face) #957

ProxyAyush · 2023-08-07T23:55:47Z

ProxyAyush
Aug 7, 2023

I cannot extract words in a pdf with using fontname in objects property, my code is below

`import pdfplumber

def get_filtered_text(file_to_parse: str) -> str:
with pdfplumber.open(file_to_parse) as pdf:
for i in range(0, 286):
text = pdf.pages[i]
clean_text = text.filter(lambda obj: obj["object_type"] == "char" and obj["size"] == 33)
#if clean_text.extract_text() != "":
#print(clean_text.extract_text())

     clean_text2 = text.filter(lambda obj: obj["object_type"] == "char" and obj["size"] <= 11 and obj["fontname"] == "Arial")
     if clean_text2.extract_text() != "":
      print(clean_text2.extract_text())

get_filtered_text("/Users/User/Desktop/kundu modified.pdf")`

jsvine · 2023-08-08T14:42:30Z

jsvine
Aug 8, 2023
Maintainer

Hi @ProxyAyush, and thanks for your interest in this library. Without the PDF, this will be difficult to diagnose. Can you provide it?

And are you sure that the fontname property of the characters you want is exactly the string "Arial"? And, if so, are you sure that there are such characters with a size smaller than 11?

Also, if you could, please edit your message to fix the formatting issues, which currently are making it a bit difficult to read the code:

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem in extracting words by Font Style (face) #957

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Problem in extracting words by Font Style (face) #957

ProxyAyush Aug 7, 2023

Replies: 1 comment

jsvine Aug 8, 2023 Maintainer

ProxyAyush
Aug 7, 2023

jsvine
Aug 8, 2023
Maintainer