Skip to content

Commit

Permalink
add enrich document paragraph
Browse files Browse the repository at this point in the history
  • Loading branch information
ilmcconnell authored Dec 7, 2020
1 parent e07bcb3 commit 614accf
Showing 1 changed file with 9 additions and 0 deletions.
9 changes: 9 additions & 0 deletions docsrc/source/ingest.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,15 @@ Word Embeddings
We provide the option to train word embeddings on top of the extracted corpuses. We use _FastText to train over the extracted
corpus at ingestion time. The resulting embeddings are saved to disk.

Context Enrichment
------------------

If the enrich option is enabled at ingest time, all output parquet files from the Ingest process are enhanced with
semantic context. For every table or table caption row in those output parquets every mention of the table label for
that row is detected within all the content (text) from that document. That context is then appended to the output
parquets as a duplicate of the original table or table caption row with the original content replaced by all the available
context text. Searches on context-enriched data that include relevant context should return the tables themselves.


.. _preprint: https://arxiv.org/abs/1910.12462
.. _PDFMiner.six: https://github.com/pdfminer/pdfminer.six
Expand Down

0 comments on commit 614accf

Please sign in to comment.