Word processing without formatting detection #1612

gerardnorton · 2023-02-24T12:14:53Z

gerardnorton
Feb 24, 2023

Use Case:

I have numerous errors and I have some doubts.
This is my configuration file:

name: "demo"
fs:
  url: "/data"
  update_rate: "30s"
  ignore_above: "100mb"
  lang_detect: false
  continue_on_error: true
  checksum: "MD5"
  follow_symlinks: false
  attributes_support: false
  index_folders: true
  index_content: true
  add_filesize: true
  store_source: false

  xml_support: false
  json_support: false
  ocr:
    enabled: false
elasticsearch:
  nodes:
    - url: "http://elasticsearch01:9200"

When I try to index the content of different types of files, I have some exceptions like the following:

11:54:59,774 \u001b[33mWARN \u001b[m [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXx.vdf]: exception parsing the csv -> IOException reading next record: java.io.IOException: (line 881) invalid char between encapsulated token and delimiter -> (line 881) invalid char between encapsulated token and delimiter

11:54:59,028 \u001b[33mWARN \u001b[m [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXx.docx]: Error creating OOXML extractor -> No valid entries or contents found, this is not a valid OOXML (Office Open XML) file -> Unexpected record signature: 0x65735504

11:53:34,508 \u001b[33mWARN \u001b[m [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXx.docx.iswr]: TIKA-198: Illegal IOException from org.apache.tika.parser.pkg.PackageParser@67511533 -> null

Questions:
Is it possible to activate only the text indexer without any additional Tika processing?
Could you provide an example of tikaConfig.xml for this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word processing without formatting detection #1612

{{title}}

Replies: 0 comments

Select a reply

Word processing without formatting detection #1612

gerardnorton Feb 24, 2023

Replies: 0 comments

gerardnorton
Feb 24, 2023