Releases: UW-COSMOS/Cosmos
Releases · UW-COSMOS/Cosmos
v0.7.1
v0.7.0
Change base image
The previous base image was deprecated. Switching to nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu22.04
as base.
v0.6.1 - Minor table extraction fix
- Fixed a bug where empty parquet files stopped all table extraction processing.
Table extraction, HTCosmos
New:
- Inclusion of table extraction (via
--extract-tables
option on ingest_documents script) - HTCosmos - run COSMOS pipeline in a high-throughput mode on an HTCondor cluster
Table context enrichment, text normalization, and fixes
-
Table context enrichment during ingestion. Enabling (via the
--use-table-context-enrichment
option on the ingest CLI) will match detected tables to mentions within the body text, adding acontext_from_text
field to the output parquet. -
The retrieval API has been updated to search either:
local_content
field (default) - the text content of the table and its associated caption, if anyfull_content
field -local_content
pluscontext_from_field
- Any of the three fields separately (
content
,caption_content
,context_from_text
)
-
Text normalization. Enabling (via the
--use-text-normalization
option on the ingest CLI) will do basic unicode normalization to regularize ligature usage and mojibake issues from the text layer. -
ASKE-ID lookup within the retrieval API.
v0.4.0 - New weights; retrieval API updates
- New weights including a newer set of annotations
- Added a few necessary files for training detection + postprocessing.
- API key requirement added (though currently disabled)
- Document level lookups and filters
- Filter by dataset_id
- Store and filter on object size
- Concatenate contents and header_content field into one full_contents field and use that for retrieval
v0.3.0 - New pipeline, entity linking and semantic context for tables
Optimization and connected components update
- Remove equation2latex
- Move merging to run.py
- Remove extra call to tesseract from list2html
- Add handling of margin objects in connected components
Attentive RCNN and updates to ingestion
Update document segmentation model with Attentive RCNN model.