Skip to content

DataBiosphere/hca-ingest

Repository files navigation

HCA Ingest

Batch ETL workflow for ingesting HCA data into the Terra Data Repository (TDR). See the architecture doc for more system design information.

Getting Started

  • Choose whether you will develop in a local virtual environment, like venv, or use the Docker Compose dev env provided here.
  • Clone this repository to a local directory and create a new dev branch, ideally named after your Jira ticket. (Note that a Jira ticket is required in order to make changes to production.)
    • Setup Git Secrets
    • If you are running in a local virtual environment go ahead and set that up. Note that this project uses Python 3.9.16
    • If you are using the provided Docker Compose dev env use the follow command to invoke it: docker compose run -w /hca-ingest app bash
      • Note that if you are not already logged in to gcloud, you will need to do so before running
        the docker compose command, as this will pull the latest image from Artifact Registry.
  • Authenticate with gcloud using your Broad credentials gcloud auth login
  • Then set up your billing project gcloud config set project PROJECT_ID
    • For prod this is mystical-slate-284720
  • Then set up your default login for applications gcloud auth application-default login
  • Build and run the dataflow tests
    • From the repository/image root: sbt test
      • if this fails you may need to get a clean clone of the repository
      • Note that this will take a while to run, as it will build the project and run all tests
    • Make sure you have poetry installed (already done in Docker image)
    • From orchestration/:
      • Run poetry install to set up a local python virtual environment and install needed dependencies
        • If you've updated the pyproject.toml you'll need to update the lock file first.
          It may be preferable to use poetry lock --no-update when updating the lockfile to avoid updating dependencies unintentionally.
        • FYI the first time you run this in your env, it can take up to 10 hours to complete, due to the large number of dependencies.
          • This is not true for the Docker image, which will make use of the poetry cache.
      • Run pytest and make sure all tests except our end-to-end suite run and pass locally
        • If you installed pytest via poetry you will need to run poetry run pytest instead.

Development Process

All code should first be developed on a branch off of the main branch. Once ready for review,
submit a PR against main and tag the broad-data-ingest-admin team for review, and ensure all checks are passing.

The Docker image at the top of the repository will be auto built & pushed to Artifact Registry
each time you push to dev or merge to main.
See ./github/workflows/build_and_publish_dev.yaml and ./github/workflows/build_and_publish_main.yaml To build manually, use update_docker_image.sh. First update the version field, then run the script.
This will build the image, tag it with the version, and push the image to Artifact Registry.
Note that this may take a bit to establish the connection to Artifact Registry, so be patient.
It may be that you are on the split VPN and/or trying to push over IPv6. Either turn off the VPN or turn off IPV6
on your router to speed this up.

Once approved and merged, the end-to-end test suite will be run. Once this passes, the dataflow
and orchestration code will be packaged into docker images for consumption by Dataflow and Dagster respectively.

See the deployment doc for next steps on getting code to dev and production.