This is a common approach to a data pipeline that is project-agnostic. This means that it should be able to be reused and extended by other code bases to serve the need.
The core idea utilises the following:
- Dagster setup via Docker / Docker compose. Changes could be made to make this something else such as Airflow, etc.
- The purpose of this type of deployment is to keep it flexible. Images can easily be run locally as well as on the cloud infrastructure.
- Dockerfiles can be pushed to a cloud infrastructure or locally. AWS terraform used as an example, but platform
doesn't rely on this.
- AWS implementation Notes:
- Images are pushed to ECR repos, and then spun up on ECS
- Files, such as pipeline code, are stored on cloud storage (such as S3) and synced in with dagster. A a separate pipeline will accomplish this deployment.
- AWS implementation Notes:
Documentation on the infrastructure can be found here
Code in the dagster folder here is what is used to run the dagit and daemon interfaces on the platform. Code servers can then be hooked in and are part of different repos as needed.