-
When I try to create an elasticsearch index by reading a PDF file, the newline character "\n" is added in the index, is any solution to avoid it? and also is any solution to avoid the special character in the document. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
The only way I can imagine today, is to use an elasticsearch ingest pipeline which can transform the extracted text. We could think of such a transformation in FSCrawler itself by implementing something similar to https://fscrawler.readthedocs.io/en/latest/admin/fs/local-fs.html#filter-content Instead of ignoring the document, we can replace using a regular expression... |
Beta Was this translation helpful? Give feedback.
-
Thanks for the reply. |
Beta Was this translation helpful? Give feedback.
The only way I can imagine today, is to use an elasticsearch ingest pipeline which can transform the extracted text.
We could think of such a transformation in FSCrawler itself by implementing something similar to https://fscrawler.readthedocs.io/en/latest/admin/fs/local-fs.html#filter-content
Instead of ignoring the document, we can replace using a regular expression...