Stream Processing Project

This is a personal project that I started with the goal to study and use some of the most used stream-processing technologies.

I am not interested in testing the scalability, robustness or any infrastructure-related theme on any of these tools. Also, I am not instered in advanced features of each technology. The purpose here is to understand how would you code a simple stream processing pipeline (something not too far from a Hello World for streaming pipelines).

Technologies Used

Architecture

Using the Python library faker, we create fake transaction data to simulate a generic e-commerce. Our main goal is to aggregate those transactions in 3-second windows to know our income in real-time (as we shall see below).

The data is produced by the fake_data.py script. One can notice that we hard-coded a list of 10 user_ids. This was done to help debugging and ensure correctness of our stream pipelines, since each user_id start with a number from 0 to 9 and the each user pays a fixed amount equal to ten times its leading number (with 0 being interpreted as 10). As an example, the user with leading number 5 will always pay the amount of 50.

We then use both Apache Flink, kSQL and Apache Spark to process the data. Our goal is to group all the paid transactions into 3 second windows and calculate:

the sum of what was paid;
the number of transactions;
the average of transactions.

Instructions

In order to run the project, just run the command make project. This will start all the docker containers and generate (almost) all the necessary resources in each container.

Since Flink doesn't have a REST API to programatically define the stream transformations, one must run

make flink-sql

to start a container with the Flink SQL Client.

Once in there, just copy & paste all the SQL instructions located in the flink folder (in the same order as the files are numbered). By the end of it, the CLI should report that a Job has been submited to Flink:

You can go to localhost:8081 and you shall see the Job Running:

With all that done, run make data N={{ number }} to send {{ number }} of messages to Kafka. The script will send the {{ number }} messages using 10 parallel processes.

You can visualize the messages sent using the Kafka UI available at localhost:8080. The messages are sent to the transactions topic. The aggregations are stored at the transactions_aggregate_{{ technology }} topic.

Then, you can see the resulting tables for results coming from kSQL in pinot using the pinot-controller interface at localhost:9000.

If needed, more commands are available in the Makefile.

Notes

In case it is needed, the ksql CLI is available by running make ksql-cli.

Learnings

We need to send a key to Kafka topic in order to use table in kSQL. See Topic #6 in https://www.confluent.io/blog/troubleshooting-ksql-part-1/.
The ROWTIME pseudo-column available in kSQL: https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/create-stream/#rowtime
Apache Pinot does a incredible job of compacting the Kafka records that are "wrong" (e.g. without the complete calculations from the window, example below) coming from kSQL (and it is fair to imagine that it would do the same job for the data coming from Spark). One could imagine what are the performance gains (or losses) if we set Kafkas log.cleanup.policy to compact instead of the default delete. Obviously this approach would need defining a key for each message, but a natural candidate for this would be the window_start (or a hash of it).

Design Choices

We delibery chose to not use default Docker images of said technologies whenever possible. Since each one can be downloaded and run locally, we chose to do (almost) the same thing using Docker containers. This allowed us to know more about the configurations available in each one (as one can see in the config) folder.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
architecture		architecture
config		config
consumer		consumer
flink		flink
imgs		imgs
ksql		ksql
pinot		pinot
producer		producer
scripts		scripts
spark		spark
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stream Processing Project

Technologies Used

Architecture

Instructions

Notes

Learnings

Design Choices

References

About

Releases

Packages

Languages

andreqaugusto/stream-processing-project

Folders and files

Latest commit

History

Repository files navigation

Stream Processing Project

Technologies Used

Architecture

Instructions

Notes

Learnings

Design Choices

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages