Skip to content

Commit

Permalink
Merge pull request #139 from UW-COSMOS/ILM_enrich_docs
Browse files Browse the repository at this point in the history
add enrich document paragraph
  • Loading branch information
iross authored Dec 7, 2020
2 parents 7c61a71 + 614accf commit 4aa562e
Showing 1 changed file with 9 additions and 0 deletions.
9 changes: 9 additions & 0 deletions docsrc/source/ingest.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,15 @@ Word Embeddings
We provide the option to train word embeddings on top of the extracted corpuses. We use _FastText to train over the extracted
corpus at ingestion time. The resulting embeddings are saved to disk.

Context Enrichment
------------------

If the enrich option is enabled at ingest time, all output parquet files from the Ingest process are enhanced with
semantic context. For every table or table caption row in those output parquets every mention of the table label for
that row is detected within all the content (text) from that document. That context is then appended to the output
parquets as a duplicate of the original table or table caption row with the original content replaced by all the available
context text. Searches on context-enriched data that include relevant context should return the tables themselves.


.. _preprint: https://arxiv.org/abs/1910.12462
.. _PDFMiner.six: https://github.com/pdfminer/pdfminer.six
Expand Down

0 comments on commit 4aa562e

Please sign in to comment.