HCA Ingest

Batch ETL workflow for ingesting HCA data into the Terra Data Repository (TDR). See the architecture doc for more system design information.

Getting Started

Choose whether you will develop in a local virtual environment, like venv, or use the Docker Compose dev env provided here.
Clone this repository to a local directory and create a new dev branch, ideally named after your Jira ticket. (Note that a Jira ticket is required in order to make changes to production.)
- Setup Git Secrets
- If you are running in a local virtual environment go ahead and set that up. Note that this project uses Python 3.9.16
  - Also install the gcloud cloud command-line tool if you've not already done so.
- If you are using the provided Docker Compose dev env use the follow command to invoke it: docker compose run -w /hca-ingest app bash
  - Note that if you are not already logged in to gcloud, you will need to do so before running
    the docker compose command, as this will pull the latest image from Artifact Registry.
Authenticate with gcloud using your Broad credentials gcloud auth login
Then set up your billing project gcloud config set project PROJECT_ID
- For prod this is mystical-slate-284720
Then set up your default login for applications gcloud auth application-default login
Build and run the dataflow tests
- From the repository/image root: sbt test
  - if this fails you may need to get a clean clone of the repository
  - Note that this will take a while to run, as it will build the project and run all tests
- Make sure you have poetry installed (already done in Docker image)
- From orchestration/:
  - Run poetry install to set up a local python virtual environment and install needed dependencies
    - If you've updated the pyproject.toml you'll need to update the lock file first.
      It may be preferable to use poetry lock --no-update when updating the lockfile to avoid updating dependencies unintentionally.
    - FYI the first time you run this in your env, it can take up to 10 hours to complete, due to the large number of dependencies.
      - This is not true for the Docker image, which will make use of the poetry cache.
  - Run pytest and make sure all tests except our end-to-end suite run and pass locally
    - If you installed pytest via poetry you will need to run poetry run pytest instead.

Development Process

All code should first be developed on a branch off of the main branch. Once ready for review,
submit a PR against main and tag the broad-data-ingest-admin team for review, and ensure all checks are passing.

The Docker image at the top of the repository will be auto built & pushed to Artifact Registry
each time you push to dev or merge to main.
See ./github/workflows/build_and_publish_dev.yaml and ./github/workflows/build_and_publish_main.yaml To build manually, use update_docker_image.sh. First update the version field, then run the script.
This will build the image, tag it with the version, and push the image to Artifact Registry.
Note that this may take a bit to establish the connection to Artifact Registry, so be patient.
It may be that you are on the split VPN and/or trying to push over IPv6. Either turn off the VPN or turn off IPV6
on your router to speed this up.

Once approved and merged, the end-to-end test suite will be run. Once this passes, the dataflow
and orchestration code will be packaged into docker images for consumption by Dataflow and Dagster respectively.

See the deployment doc for next steps on getting code to dev and production.

Name		Name	Last commit message	Last commit date
Latest commit History 348 Commits
.github		.github
docker/gcs-python		docker/gcs-python
ops/helmfiles		ops/helmfiles
orchestration		orchestration
project		project
schema/src/main/jade-tables		schema/src/main/jade-tables
transformation/src		transformation/src
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
docker-compose.yaml		docker-compose.yaml
importer-flow.png		importer-flow.png
update_docker_image.sh		update_docker_image.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HCA Ingest

Getting Started

Development Process

About

Releases

Packages

Contributors 13

Languages

License

DataBiosphere/hca-ingest

Folders and files

Latest commit

History

Repository files navigation

HCA Ingest

Getting Started

Development Process

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 13

Languages

Packages