This tool creates a simple cloud system component of a big data processing architectecture. The repository includes basic code and tools required for readily setup a cloud component of a big data pipeline to enable developers and researchers perform additional testing of the code. This project works syncronously with the edge-sim project to quickly create a three tiered architecture.
The overall workflow/architecture can be seen in Figure 1.
The different components available are:
- zookeeper : Service discovery container required for apache kafka
- kafka: Message broker with scalability (See how to scale up/down the service below)
- database: MongoDB database for stream data ingestion
- stream-processor: Provides streaming analytics by consuming kafka messages for a fixed window interval
- database-processor: Provides database ingestion into the mongodb database
- spark: The master/controller instance of Apache Spark Node
- spark-worker: Worker nodes that connect the spark service (See how to scale up/down the service below).
- mongo-express: Tool for connecting to database service
In addition, we also have Kafka message consumer code in /Util
directory.
- Figure 1: Workflow and architecture of cloud pipeline sub-systems
To run the cloud pipeline service, we need to perform the follwing:
To start Kakfa, first run zookeeper:
$ docker-compose up -d zookeeper
Next start the Kafka brokers by:
$ docker-compose up --scale kafka=NUMBER_OF_BROKERS
To start MongoDB, just run the command:
$ docker-compose up -d database
To start the Kafka consumer service, run the following command while Kafka is running:
$ docker-compose up --scale kafka=NUMBER_OF_BROKERS database-processor
Note: The Kafka Consumer requires a comma seperated list of Kafka brokers. It has to be provided in the entrypoint
config of the docker-compose.yml
file.
Example: entrypoint: ["python3", "MongoIngestor.py", "192.168.1.12:32812,192.168.1.12:32814", "kafka-database-consumer-group-2"]
To start spark, run the following docker-compose command
- Start master/controller node
$ docker-compose up spark
- Start multiple instances of worker/resposender node
$ docker-compose scale spark-worker=2
To start the stream-processor application, use the following command:
$ docker-compose up stream-processor
Start this using:
$ docker-compose up --scale kafka=2 database-processor
Note: The database-processor
and stream-processor
applications both belong to separate consumer groups in Kafka. As such, running both of them will provide simultaneous stream ingestion and processing capability.
The python script to consume any message on any topic in present in /Utils
folder. Launch it as:
$ python3 client-report.py "kafka_broker" "topic_to_connect"
for example:
$ python3 client-report.py "192.168.1.12:32812,192.168.1.12:32814" report
The recommended application for monitoring is netdata.
- Figure 2: Sample application monitoring (Notice the containers at the bottom right)