The repo jsut a template which consume kafka with spark structured streaming, and emulate to write to AWS S3 with delta lake.
[AWS]
* AWS S3: the bucket which save consumed data from kafka with delta lake.
* AWS SSM: save related username and password of the schema registry.
* AWS EMR: emulate our spark structured streaming is running on AWS EMR.
[Other Service]
* Kafka: can self-hosted or AWS MSK.
* Schema Registry: self-hosted schema registry service, which assist to do schema validation.
|-- project_libs/
|-- __init__.py
|-- common/
|-- __init__.py
|-- loadconfig.py
|-- utils.py
|-- project_configs/
|-- general_config.yaml
|-- topics_config.yaml
|-- consumer_config.yaml
|-- helpers/
|-- __init__.py
|-- schema_registry.py
|-- src/
|-- spark_consumer.py
|-- test/
|-- submit_steps.py
|-- spark_bootstrap.sh
|-- setup.py
|-- README.md
project_libs/
folder:- Includes general lib script,
loadconfig.py
is load yaml file, andutils.py
is util script.
- Includes general lib script,
project_configs/
folder:- general_config.yaml: general config yaml file.
- topics_config.yaml: topic name by environment, just for testing and emulate.
- consumer_config.yaml: the config setting about consume data from kafka.
helpers/
folder:- schema_registry.py: schema registry client.
src/
folder:- spark_consumer.py: main consume kafka script.
test/
folder:- submit_steps.py: It used to add step to AWS EMR with boto3.
spark_bootstrap.sh
: It used to bootstrap emr cluster script.setup.py
: It used to setup as egg package, and can be used emr steps.
ps. All string include {}
just as template in config file. When we use it, need to replace proper value to config file.