Update github action to check if outputs already exist #103

AlexAxthelm · 2024-06-19T14:45:09Z

Runnning index prep is a long process that's part of the current build process for workflow.transition.monitor, and will probably be for workflow.pacta.webapp as well.

Given that the indices don't actually change that much, it would make sense to do a check if the process needs to actually run, or if we can bypass it, and return the previous results instead.

My general thinking is to construct a hash based on:

all the pacta-data,
files from this repo
benchmark_inputs files
docker image (SHA)

and use that as a versioning key, and then we can check if the appropriate files exist in the blob store/AFS

so it would look something like:

# Download all the files from AZ
# Pull the Base image

- run: |
  hash=some_magic_function(all those inputs)
  echo "hash=$hash" >> "$GITHUB_OUTPUT"

- name: check if file exists on AZ
  run: |
    response=$(az storage blob exists --blog-url "$CONTAINER_URL/$hash/foo.rds" )
    parsed_response="$(jq '.some_filter')" >> "$GITHUB_OUTPUT"

- if: ${{ parsed_response }}
  # early return of the Blob URL

- if: ${{ parsed_response }}
  #run the rest of the process normally

cc @jdhoffa @cjyetman, Do I have the hashkeys right, or am I missing something?

The text was updated successfully, but these errors were encountered:

jdhoffa · 2024-07-02T10:58:18Z

@AlexAxthelm there are benchmark inputs that are scraped on the fly using the pacta.data.scraping package, e.g. https://github.com/RMI-PACTA/pacta.data.scraping/blob/main/R/get_ishares_index_data.R

I guess the hash would need to depend also on the result of that scraping to be complete?

cjyetman · 2024-07-02T11:31:07Z

Conceptually, makes sense to me, though...

I don't think the "files from this repo" matter, if I understand that correctly... either the prepared benchmark files ("benchmark_inputs files"? the benchmark portfolios that this workflow outputs?) are the same or not
when / how often would the situation happen that the same TM Docker image is being used with the same data and the same benchmark portfolios? possibly on a re-run of actions on a PR when no changes have been made?
this should be handled in workflow.transition.monitor in the Azure scripts no? since this repo never has any idea about what pacta-data or Docker image is being used there

Scratch all that... I forgot how this repo is being used. I'll have to think about that once my memory has improved.

AlexAxthelm · 2024-07-02T12:25:44Z

when / how often would the situation happen that the same TM Docker image is being used with the same data and the same benchmark portfolios? possibly on a re-run of actions on a PR when no changes have been made?

This happens a lot. we push a lot of changes to workflow.transition.monitor that aren't changing any of the processing code, or don't require a rebuild of the docker image (or rather, the entire image can be rebuilt from cache)

AlexAxthelm self-assigned this Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update github action to check if outputs already exist #103

Update github action to check if outputs already exist #103

AlexAxthelm commented Jun 19, 2024

jdhoffa commented Jul 2, 2024

cjyetman commented Jul 2, 2024 •

edited

Loading

AlexAxthelm commented Jul 2, 2024

Update github action to check if outputs already exist #103

Update github action to check if outputs already exist #103

Comments

AlexAxthelm commented Jun 19, 2024

jdhoffa commented Jul 2, 2024

cjyetman commented Jul 2, 2024 • edited Loading

AlexAxthelm commented Jul 2, 2024

cjyetman commented Jul 2, 2024 •

edited

Loading