spark-structured-streaming

Little project to play with Spark Structured Streaming, Scala, Avro and Kafka, PostgreSQL.

Tools

to run the project, you need Spark which can be download here

I use also ksql-datagen. It's a tool available in the Confluent platform which ben be download here. I use the version 6.0.0 of the Confluent Platform (currently the last).

Create some environment variables which point to the home folder of the project. I have CONFLUENT_HOME to use ksql-datagen and SPARK_HOME to use Spark.

How to run

Clone the repository of this project.

You need to run the Kafka cluster. Use docker-compose -f docker-compose.yml up

How to run the JSON version

To run the project, run :

sbt package . It will generate you a JAR which will be used by Spark. You can find it in the target/scala-2.12 folder with the name spark-structured-streaming_2.12-0.1.jar.
$SPARK_HOME/bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1 --class "JsonConsumer" --master local ./target/scala-2.12/spark-structured-streaming_2.12-0.1.jar. The JsonConsumercorresponds to the name of the Scala Consumer located in the src folder.
After, we will generate some data with the ksql-datagen tool.
$CONFLUENT_HOME/bin/ksql-datagen schema=schemas/transactions.avro format=json topic=transactions-json key=transationid iterations=1000
The transactions.avro schema will let to ksql-datagen to create data with a specified format.
If you return on the window of Spark, you will see data appear.

How to run the AVRO version

To run the project, clone it and run :

sbt assembly . It will generate you a JAR which will be used by Spark. You can find it in the target/scala-2.12 folder with the name spark-structured-streaming.jar.
First, we need to generate some data with the ksql-datagen tool to create the schemas in the Registry. The Avro consumer and Spark need them to correctly run.
$CONFLUENT_HOME/bin/ksql-datagen schema=schemas/transactions.avro format=avro topic=transactions-avro key=transactionid iterations=1
$CONFLUENT_HOME/bin/ksql-datagen schema=schemas/clients.avro format=avro topic=clients-avro key=clientid iterations=1
Now, we can launch the consumer
$SPARK_HOME/bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1 --class "AvroConsumer" --master local ./target/scala-2.12/spark-structured-streaming.jar. The AvroConsumercorresponds to the name of the Scala Consumer located in the src folder.
Because of the latest configuration of the consumer, it will no re-read previously pushed events, so you have to re-launch the two commands of ksql-datagen.
The Avro consumer read two different sources (or topics) : clients and transactions. The two topics are linked by a clientIdfield. I configure a kind of join function between the two sources, on the clientId field.
If you return on the window of Spark, you will see data appear, but only when the transaction.clientId = client.clientId :)
if you look the code, you can see a dropDuplicates function. We need this function to avoid the streams to become too big. Without it, it can wait a long time to have two linked entities. I choose to just drop the previous entry, if two clients have the same Ids, the old one is dropped. (it not the best behaviour but it's just a test, I will work to improve that)
You have another way to do it, with a kind of eventTime field. You put as value the moment when you push the event, and if the event is too old, it will be remove of the stream. I think it's a better way of do what I want, I will work on that.

Scripts

Some scripts are available in the scripts folder. For example, to add transactions for the JSON consumer, you can use (since the scripts folder) :

sh transactions.sh json

For Avro consumer, just replace json by avro.

How to run the PostgreSQL version

First, we need the PostgreSQL driver (can be found here). I use in this repository the 42.2.18 version.
I put the JAR at the root of the repository
After, I use a tool called pgcli to create my table and to insert data. The command lines can be found in the script folder.
After, go to script and launch execute-postgresql-consumer.sh.

How to run the Parquet version

First, you need 9 JARS. The list can be found in the file execute-parquet-consumer.sh and get them on Maven repository. I put them in a jars folder. If you want put them in another folder, don't forget to update the path in the script (the jars/XXX of each jar)
After that, you just have to run the script execute-parquet-consumer.sh.
We can play with the Data (in the Seq dataHeroes and/or dataCities in the ParquetConsumer Scala file) to add more data or update it. We also can play we the SQL requests.
I had some problems of compilation with the toDf and to toDs functions. It seems the position of the import of the implicits has an importance, I can't understand it for the moment, but I continue to search :)

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
project		project
schemas		schemas
scripts		scripts
src/main/scala		src/main/scala
README.md		README.md
build.sbt		build.sbt
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-structured-streaming

Tools

How to run

How to run the JSON version

How to run the AVRO version

Scripts

How to run the PostgreSQL version

How to run the Parquet version

About

Releases

Packages

Languages

PLarboulette/spark-structured-streaming

Folders and files

Latest commit

History

Repository files navigation

spark-structured-streaming

Tools

How to run

How to run the JSON version

How to run the AVRO version

Scripts

How to run the PostgreSQL version

How to run the Parquet version

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages