Skip to content

spark + airflow + kafka dynamic graph generation, dockerized (single docker-compose)

License

Notifications You must be signed in to change notification settings

alecuba16/sdggroup_technical_test_spark_kafka_airflow

Repository files navigation

sdggroup_technical_test_spark_kafka_airflow

This is the deliverable with the necessary material and configuration to generate the complete platform with docker-composer. No external tools are required.

Environment preparation steps

Build

The docker-composer has to be built because I had to extend the airflow base image to include JAVA as it is required by the spark_submit command. Also, there are other images that have been extended and configured, like the HDFS.

docker-compose up --build

Upload person_inputs.json data (required)

Once all the services all up (check with docker-compose ps) , at least the Hadoop FS ones (name node and data node), you have to run the script at the root of the project :

./upload_person_input.sh

That script cleans the old data and reuploads the file to the HDFS.

Graph generator

To use the graph generator you have to add the input files person_inputs.json with the data and sdg_template.json with the metadata inside the input_files directory.

The generator

The graph generator is inside the folder graph_generator.py and it requires the file dag_template.py since the common transformation operations (@task Taskflow tasks) are there. The graph generator takes the required fields from the template and, with the content of the dag_template.py, generates a new file in the folder airflow/dags/ with a dynamic name as provided in the sdg_template.json.

The generated file, requires in airflow runtime of the spark_template.py file that is inside airflow/include/ folder.

screenshots

Docker containers running Docker containers running

Airflow autogenerated dag Airflow autogenerated dag

Airflow autogenerated dag detail 1 Airflow autogenerated dag detail 1

Airflow autogenerated dag detail 2 Airflow autogenerated dag detail 2

Spark UI after task finish Spark UI after task finish

Kafka Before msg Kafka Before msg

Kafka After msg Kafka After msg

Cleanup (beware!)

Remove old images

docker rmi -f $(docker images -a -q)

Purge images,etc

docker container prune -f && docker image prune -f && docker volume prune -f

About

spark + airflow + kafka dynamic graph generation, dockerized (single docker-compose)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published