Store all cosmos extractions with document #49

brandomr · 2023-09-12T18:07:34Z

Overview

#42 implements a 1:1 swap for SKEMA's PDF extraction with Cosmos's and includes polling for Cosmos results. However, @mattprintz's PR only fetches back the JSON that Cosmos extracts, not the other useful elements:

extracted tables
extracted images
extracted figures

As seen in UWisc's demo notebook these are also available to be fetched for 24 hours after the PDF is sent over the wire. One outstanding issue is where to store this information though: these are pieces of the document and shouldn't be put into the document's file_names but should instead go somewhere else. See the corresponding issue on TDS

The text was updated successfully, but these errors were encountered:

Sorrento110 · 2023-09-12T18:56:04Z

TODOS:

Run UWisc's notebook to generate other Cosmos extractions.
Explore the best potential changes to TDS to store these different extractions.
Set up a meeting (Brandon, Yohann, Powell, others) to discuss the options for storing extractions in TDS. Relates: Add assets to documents data-service#326
Make the decided changes to TDS in order to store the extractions.
- update documents model with new assets field
- create uploader mechanism for assets
- create downloader mechanism for assets
Working off of code implemented in Switch to Cosmos extractor from SKEMA #42, implement new code to store the other important extractions in TDS for each document.
- add fetch calls for all assets generated by cosmos, grabbing each of their JSON blobs.
- write assets zip file to temporary location
- parse through assets zip file extracting all parquets and .png files
- collect all asset metadata from JSON blobs.
- ~~parse through asset parquet files to get asset contents as they are not present in the JSON object (Tables, Equations)~~(Cosmos adding this to JSON metadata)
- ~~merge parquet metadata with JSON metadata~~ (Cosmos adding this to JSON metadata)
- patch document assets in TDS with asset metadata
- get presigned URL to upload each of the raw assets as .png files.
- upload each document asset to S3.
- remove the temporary zip file and all other extracted files.

brandomr · 2023-09-14T16:13:31Z

@Sorrento110 I see the task for making changes to TDS as checked off but no update on DARPA-ASKEM/data-service#326 or a PR. Can you get a PR up for that if it's done?

brandomr assigned Sorrento110 Sep 12, 2023

Sorrento110 linked a pull request Sep 14, 2023 that will close this issue

Added code to cosmos_extraction to download all document assets... #52

Merged

Sorrento110 mentioned this issue Sep 14, 2023

Added code to cosmos_extraction to download all document assets... #52

Merged

brandomr closed this as completed in #52 Sep 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store all cosmos extractions with document #49

Store all cosmos extractions with document #49

brandomr commented Sep 12, 2023

Sorrento110 commented Sep 12, 2023 •

edited

Loading

brandomr commented Sep 14, 2023

Store all cosmos extractions with document #49

Store all cosmos extractions with document #49

Comments

brandomr commented Sep 12, 2023

Overview

Sorrento110 commented Sep 12, 2023 • edited Loading

brandomr commented Sep 14, 2023

Sorrento110 commented Sep 12, 2023 •

edited

Loading