Avoid newline character "\n" when indexing a PDF file #1577

pandveera · 2023-01-03T09:02:37Z

pandveera
Jan 3, 2023

When I try to create an elasticsearch index by reading a PDF file, the newline character "\n" is added in the index, is any solution to avoid it? and also is any solution to avoid the special character in the document.

Answered by dadoonet

Jan 3, 2023

The only way I can imagine today, is to use an elasticsearch ingest pipeline which can transform the extracted text.

We could think of such a transformation in FSCrawler itself by implementing something similar to https://fscrawler.readthedocs.io/en/latest/admin/fs/local-fs.html#filter-content

Instead of ignoring the document, we can replace using a regular expression...

View full answer

dadoonet · 2023-01-03T09:57:06Z

dadoonet
Jan 3, 2023
Maintainer

The only way I can imagine today, is to use an elasticsearch ingest pipeline which can transform the extracted text.

We could think of such a transformation in FSCrawler itself by implementing something similar to https://fscrawler.readthedocs.io/en/latest/admin/fs/local-fs.html#filter-content

Instead of ignoring the document, we can replace using a regular expression...

0 replies

pandveera · 2023-01-06T05:31:01Z

pandveera
Jan 6, 2023
Author

Thanks for the reply.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid newline character "\n" when indexing a PDF file #1577

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Avoid newline character "\n" when indexing a PDF file #1577

pandveera Jan 3, 2023

Replies: 2 comments

dadoonet Jan 3, 2023 Maintainer

pandveera Jan 6, 2023 Author

pandveera
Jan 3, 2023

dadoonet
Jan 3, 2023
Maintainer

pandveera
Jan 6, 2023
Author