Word processing without formatting detection #1612
gerardnorton
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Use Case:
I have numerous errors and I have some doubts.
This is my configuration file:
When I try to index the content of different types of files, I have some exceptions like the following:
11:54:59,774 \u001b[33mWARN \u001b[m [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXx.vdf]: exception parsing the csv -> IOException reading next record: java.io.IOException: (line 881) invalid char between encapsulated token and delimiter -> (line 881) invalid char between encapsulated token and delimiter
11:54:59,028 \u001b[33mWARN \u001b[m [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXx.docx]: Error creating OOXML extractor -> No valid entries or contents found, this is not a valid OOXML (Office Open XML) file -> Unexpected record signature: 0x65735504
11:53:34,508 \u001b[33mWARN \u001b[m [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXx.docx.iswr]: TIKA-198: Illegal IOException from org.apache.tika.parser.pkg.PackageParser@67511533 -> null
Questions:
Is it possible to activate only the text indexer without any additional Tika processing?
Could you provide an example of tikaConfig.xml for this?
Beta Was this translation helpful? Give feedback.
All reactions