Skip to content

extract_words() behaving poorly while extract_text() works fine with the same parameters. #503

Answered by jsvine
PrimoJefe asked this question in Q&A
Discussion options

You must be logged in to vote

Hello, and thanks for your interest in this library. Apologies for not responding sooner. Are you able to provide the PDF and specific page number? It's a bit difficult to debug without it. But a couple of observations that may or may not be helpful:

  • You're passing a y_tolerance in your first example, but not in your second.

  • Passing extra_attrs changes how pdfplumber groups characters into words. In your example, if two characters do not share the same size and fontname, they will not be grouped into the same word.

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@PrimoJefe
Comment options

Answer selected by PrimoJefe
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants