Batch ETL pipeline to mirror ENCODE data into the Terra Data Repository (TDR). See the architecture documentation for further design details.
Orchestration of the ETL flows in this project is implemented using Argo Workflows.
The core extraction and transformation data pipelines are implemented in Scio on top of Apache Beam.
After cloning the repository, ensure you can compile the code, auto-generate schema classes and run the test suite from the repository root:
sbt test
All development should be done on branches off of the protected master
branch. After review, merge to master
and then follow the instuctions in the monster-deploy repo
When modifying the Scio data pipelines, it's possible to run the pipeline locally by invoking the relevant pipeline:
- Extraction:
sbt "encode-extraction / runMain org.broadinstitute.monster.encode.extraction.ExtractionPipeline --outputDir=<some local directory>"
- Transformation
sbt "encode-transformation-pipeline / runMain org.broadinstitute.monster.encode.transformation.TransformationPipeline --inputPrefix=<extraction dir> --outputPrefix=<output dir>"
Development of Argo changes requires deployment to the DEV environment as documented in the monster-deploy repo
- If so, clone this repo
https://github.com/DataBiosphere/ingest-utils.git
- Check out the branch
ah_m1arch
-
Before you commit changes, make sure the build occurs without error
sbt compile
and that the tests run without errorssbt test
.The build may reformat some of your files. Make sure to do a diff and add any changes to your git staging area.
-
Commit changes to local branch
git commit -m "<comment>"
-
Create a version tag for the branch
git tag v1.0.<new-number>
. For examplev1.0.120
You can run a
git log
to see the previous version number or go to the actions tab for the repohttps://github.com/DataBiosphere/encode-ingest/actions
to see the previous version built. It is very important the the version format is exact. No.
between thev
and the1
. -
Now push the branch and tag to the remote
git push origin : v1.0.<x>
You can also push your branch to the remote and then separately push the tags to the remote
git push --tags
This will start 2 build actions. If the build fails, you can look at the details for errors or just run a local
sbt compile
. In some cases you may have forgotten to commit formatting changes done automatically bysbt compile
. Just add them to your staging area and start with #2 above.