This is a personal project that I started with the goal to study and use some of the most used stream-processing technologies.
I am not interested in testing the scalability, robustness or any infrastructure-related theme on any of these tools. Also, I am not instered in advanced features of each technology. The purpose here is to understand how would you code a simple stream processing pipeline (something not too far from a Hello World
for streaming pipelines).
Using the Python library faker
, we create fake transaction
data to simulate a generic e-commerce. Our main goal is to aggregate those transactions in 3-second windows to know our income in real-time (as we shall see below).
The data is produced by the fake_data.py
script. One can notice that we hard-coded a list of 10 user_id
s. This was done to help debugging and ensure correctness of our stream pipelines, since each user_id
start with a number from 0 to 9 and the each user pays a fixed amount equal to ten times its leading number (with 0
being interpreted as 10
). As an example, the user with leading number 5
will always pay the amount of 50
.
We then use both Apache Flink
, kSQL
and Apache Spark
to process the data. Our goal is to group all the paid transactions into 3 second windows and calculate:
- the sum of what was paid;
- the number of transactions;
- the average of transactions.
In order to run the project, just run the command make project
. This will start all the docker containers and generate (almost) all the necessary resources in each container.
Since Flink
doesn't have a REST API to programatically define the stream transformations, one must run
make flink-sql
to start a container with the Flink SQL Client.
Once in there, just copy & paste all the SQL
instructions located in the flink
folder (in the same order as the files are numbered). By the end of it, the CLI
should report that a Job has been submited to Flink
:
You can go to localhost:8081 and you shall see the Job Running:
With all that done, run make data N={{ number }}
to send {{ number }}
of messages to Kafka
. The script will send the {{ number }}
messages using 10 parallel processes.
You can visualize the messages sent using the Kafka UI available at localhost:8080. The messages are sent to the transactions
topic. The aggregations are stored at the transactions_aggregate_{{ technology }}
topic.
Then, you can see the resulting tables for results coming from kSQL
in pinot
using the pinot-controller
interface at localhost:9000.
If needed, more commands are available in the Makefile
.
- In case it is needed, the
ksql
CLI is available by runningmake ksql-cli
.
- We need to send a key to Kafka topic in order to use
table
inkSQL
. See Topic #6 in https://www.confluent.io/blog/troubleshooting-ksql-part-1/. - The
ROWTIME
pseudo-column available inkSQL
: https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/create-stream/#rowtime Apache Pinot
does a incredible job of compacting the Kafka records that are "wrong" (e.g. without the complete calculations from the window, example below) coming fromkSQL
(and it is fair to imagine that it would do the same job for the data coming fromSpark
). One could imagine what are the performance gains (or losses) if we setKafka
s log.cleanup.policy tocompact
instead of the defaultdelete
. Obviously this approach would need defining a key for each message, but a natural candidate for this would be thewindow_start
(or a hash of it).
We delibery chose to not use default Docker images of said technologies whenever possible. Since each one can be downloaded and run locally, we chose to do (almost) the same thing using Docker containers. This allowed us to know more about the configurations available in each one (as one can see in the config
) folder.
- https://www.confluent.io/blog/real-time-analytics-with-kafka-and-pinot/
- https://docs.pinot.apache.org/basics/getting-started/running-pinot-in-docker
- https://www.uber.com/en-BR/blog/real-time-exactly-once-ad-event-processing/
- https://itnext.io/exploring-popular-open-source-stream-processing-technologies-part-1-of-2-31069337ba0e & https://itnext.io/exploring-popular-open-source-stream-processing-technologies-part-2-of-2-2832b7727cd0