Skip to content

Avoid newline character "\n" when indexing a PDF file #1577

Answered by dadoonet
pandveera asked this question in Q&A
Discussion options

You must be logged in to vote

The only way I can imagine today, is to use an elasticsearch ingest pipeline which can transform the extracted text.

We could think of such a transformation in FSCrawler itself by implementing something similar to https://fscrawler.readthedocs.io/en/latest/admin/fs/local-fs.html#filter-content

Instead of ignoring the document, we can replace using a regular expression...

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by dadoonet
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
feature_request for feature request
2 participants
Converted from issue

This discussion was converted from issue #1576 on January 06, 2023 07:19.