Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update github action to check if outputs already exist #103

Open
AlexAxthelm opened this issue Jun 19, 2024 · 3 comments
Open

Update github action to check if outputs already exist #103

AlexAxthelm opened this issue Jun 19, 2024 · 3 comments
Assignees

Comments

@AlexAxthelm
Copy link
Contributor

Runnning index prep is a long process that's part of the current build process for workflow.transition.monitor, and will probably be for workflow.pacta.webapp as well.

Given that the indices don't actually change that much, it would make sense to do a check if the process needs to actually run, or if we can bypass it, and return the previous results instead.

My general thinking is to construct a hash based on:

  • all the pacta-data,
  • files from this repo
  • benchmark_inputs files
  • docker image (SHA)

and use that as a versioning key, and then we can check if the appropriate files exist in the blob store/AFS

so it would look something like:

# Download all the files from AZ
# Pull the Base image

- run: |
  hash=some_magic_function(all those inputs)
  echo "hash=$hash" >> "$GITHUB_OUTPUT"

- name: check if file exists on AZ
  run: |
    response=$(az storage blob exists --blog-url "$CONTAINER_URL/$hash/foo.rds" )
    parsed_response="$(jq '.some_filter')" >> "$GITHUB_OUTPUT"

- if: ${{ parsed_response }}
  # early return of the Blob URL

- if: ${{ parsed_response }}
  #run the rest of the process normally

cc @jdhoffa @cjyetman, Do I have the hashkeys right, or am I missing something?

@AlexAxthelm AlexAxthelm self-assigned this Jun 19, 2024
@jdhoffa
Copy link
Member

jdhoffa commented Jul 2, 2024

@AlexAxthelm there are benchmark inputs that are scraped on the fly using the pacta.data.scraping package, e.g. https://github.com/RMI-PACTA/pacta.data.scraping/blob/main/R/get_ishares_index_data.R

I guess the hash would need to depend also on the result of that scraping to be complete?

@cjyetman
Copy link
Member

cjyetman commented Jul 2, 2024

Conceptually, makes sense to me, though...

  • I don't think the "files from this repo" matter, if I understand that correctly... either the prepared benchmark files ("benchmark_inputs files"? the benchmark portfolios that this workflow outputs?) are the same or not
  • when / how often would the situation happen that the same TM Docker image is being used with the same data and the same benchmark portfolios? possibly on a re-run of actions on a PR when no changes have been made?
  • this should be handled in workflow.transition.monitor in the Azure scripts no? since this repo never has any idea about what pacta-data or Docker image is being used there

Scratch all that... I forgot how this repo is being used. I'll have to think about that once my memory has improved.

@AlexAxthelm
Copy link
Contributor Author

  • when / how often would the situation happen that the same TM Docker image is being used with the same data and the same benchmark portfolios? possibly on a re-run of actions on a PR when no changes have been made?

This happens a lot. we push a lot of changes to workflow.transition.monitor that aren't changing any of the processing code, or don't require a rebuild of the docker image (or rather, the entire image can be rebuilt from cache)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants