GitHub - ashrithr/LogEventsProcessingSpark: real time log event processing using spark, kafka & cassandra

Sample workflow illustrating spark streaming, kafka and cassandra integration.

Pre-Requisites:

git
sbt
scala
kafka
spark
cassandra

Get the source and build a package for spark streaming job

cd /opt/
git clone https://github.com/ashrithr/LogEventsProcessingSpark.git
cd LogEventsProcessingSpark
sbt package

To emulate real time log events, we are going to use an application called as generator

cd /opt/
git clone https://github.com/cloudwicklabs/generator.git
cd generator
sbt assembly

Start an instance of zookeeper server, which is required by kafka:
```
cd ${KAFKA_HOME}
bin/zookeeper-server-start.sh config/zookeeper.properties
```
Replace ${KAFKA_HOME} with path where you have installed kafka
Start an instance of kafka broker:
```
cd ${KAFKA_HOME}
bin/kafka-server-start.sh config/server.properties
```
Replace ${KAFKA_HOME} with path where you have installed kafka
Create a kafka topic (with retention period of 60 minutes):
```
cd ${KAFKA_HOME}
bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic apache \
    --partitions 2 --replication-factor 1 \
    --config flush.messages=1 --config retention.ms=60
```
more options: here

Replace ${KAFKA_HOME} with path where you have installed kafka. Also, change the replication factor and number of partitions based on how many kafka brokers you have
Start an instance of spark server:
```
${SPARK_HOME}/sbin/start-master.sh
```
Replace ${SPARK_HOME} with path where you have installed spark
Start an instance of worker locally:
```
${SPARK_HOME}/sbin/start-slave.sh Worker ${SPARK_URL}
```
where, SPARK_URL="spark://Ashriths-MacBook-Pro.local:7077", replace that with your instance of spark server
Start an instance of cassandra in foreground, if you already have a cassandra instance ignore this step:
```
${CASSANDRA_HOME}/bin/cassandra -f
```

Start the generator, which will simulate log events and writes it to kafka directly:

cd /opt/generator
bin/generator log --eventsPerSec 1 --outputFormat text --destination kafka \
    --kafkaBrokerList "localhost:9092" --flushBatch 1 --kafkaTopicName apache

Finally, start Spark streaming application:

Make sure to edit the parameters in run.sh as required to match your environment

SCALA_HOME="/opt/scala" ./run.sh

You could check the spark consumption from kafka to make sure spark is reading events from kafka:

cd ${KAFKA_HOME}
bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker \
   --zkconnect localhost:2181 \
   --group sparkFetcher

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
lib		lib
project		project
src/main		src/main
.gitignore		.gitignore
ReadMe.md		ReadMe.md
build.sbt		build.sbt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sample workflow illustrating spark streaming, kafka and cassandra integration.

About

Releases

Packages

Languages

ashrithr/LogEventsProcessingSpark

Folders and files

Latest commit

History

Repository files navigation

Sample workflow illustrating spark streaming, kafka and cassandra integration.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages