Skip to content

jaehyeon-kim/data-lake-demo

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

data-lake-demo

Data lake demo using change data capture (CDC) on AWS.

Architecture

architecture

  1. Employing the transactional outbox pattern, the source database publishes change event records to the CDC event table. The event records are generated by triggers that listen to insert and update events on source tables.
  2. CDC is implemented in a streaming environment and Amazon MSK is used to build the streaming infrastructure. In order to process the real-time CDC event records, a source and sink connectors are set up in Amazon MSK Connect. The Debezium connector for PostgreSQL is used as the source connector and the Lenses S3 connector is used as the sink connector. The sink connector pushes messages to a S3 bucket.
  3. Hudi DeltaStreamer is run on Amazon EMR. As a spark application, it reads files from the S3 bucket and upserts Hudi records to another S3 bucket. The Hudi table is created in the AWS Glue Data Catalog.
  4. The Hudi table is queried in Amazon Athena while the table is registered in the AWS Glue Data Catalog.
  5. Dashboards are created in Amazon Quicksight where the dataset is created using Amazon Athena.

Releases

No releases published

Packages

No packages published

Languages

  • PLpgSQL 50.4%
  • Python 24.8%
  • Shell 20.2%
  • Dockerfile 4.6%