Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store all cosmos extractions with document #49

Closed
brandomr opened this issue Sep 12, 2023 · 2 comments · Fixed by #52
Closed

Store all cosmos extractions with document #49

brandomr opened this issue Sep 12, 2023 · 2 comments · Fixed by #52
Assignees

Comments

@brandomr
Copy link
Contributor

Overview

#42 implements a 1:1 swap for SKEMA's PDF extraction with Cosmos's and includes polling for Cosmos results. However, @mattprintz's PR only fetches back the JSON that Cosmos extracts, not the other useful elements:

  • extracted tables
  • extracted images
  • extracted figures

As seen in UWisc's demo notebook these are also available to be fetched for 24 hours after the PDF is sent over the wire. One outstanding issue is where to store this information though: these are pieces of the document and shouldn't be put into the document's file_names but should instead go somewhere else. See the corresponding issue on TDS

@Sorrento110
Copy link
Contributor

Sorrento110 commented Sep 12, 2023

TODOS:

  • Run UWisc's notebook to generate other Cosmos extractions.
  • Explore the best potential changes to TDS to store these different extractions.
  • Set up a meeting (Brandon, Yohann, Powell, others) to discuss the options for storing extractions in TDS. Relates: Add assets to documents data-service#326
  • Make the decided changes to TDS in order to store the extractions.
    • update documents model with new assets field
    • create uploader mechanism for assets
    • create downloader mechanism for assets
  • Working off of code implemented in Switch to Cosmos extractor from SKEMA #42, implement new code to store the other important extractions in TDS for each document.
    • add fetch calls for all assets generated by cosmos, grabbing each of their JSON blobs.
    • write assets zip file to temporary location
    • parse through assets zip file extracting all parquets and .png files
    • collect all asset metadata from JSON blobs.
    • parse through asset parquet files to get asset contents as they are not present in the JSON object (Tables, Equations)(Cosmos adding this to JSON metadata)
    • merge parquet metadata with JSON metadata (Cosmos adding this to JSON metadata)
    • patch document assets in TDS with asset metadata
    • get presigned URL to upload each of the raw assets as .png files.
    • upload each document asset to S3.
    • remove the temporary zip file and all other extracted files.

@brandomr
Copy link
Contributor Author

@Sorrento110 I see the task for making changes to TDS as checked off but no update on DARPA-ASKEM/data-service#326 or a PR. Can you get a PR up for that if it's done?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants