Skip to content

tgsaman/Spark-Kafka-Databricks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Using Spark & Kafka in Databricks

About

These notes were put together to help my fellow Kubrick Consultants master the syntax of PySpark & Kafka, the Python packages we use to compute in Databricks.

Unfortunately, I was laid off before I had the opportunity to finish this repository as a company resource. Instead, I've opted to share a scrubbed version publicly, under the MIT license, to display my command of Spark and Kafka as tools.

Layout & Use:

This project is broken up into two sub-folders, one for each package we used with Databricks, Spark & Kafka.

  • Kafka Practice Streams
  • Spark Practice Exercises

Within each folder, there are a set of techniques on display across a number of excercises. (Data Ingestion for Spark and Data Streaming for Kafka).

Setup & Installation:

Run these notebooks in Databricks (preferred method)

In order to run the Spark dependencies, you'll need an underlying cluster compute protocol. Those are a bit of a pain to get running, so I'm using Jupyter notebooks for now. Such notebooks allowed me to keep detailed notes during the learning process. Fortunately, Databicks allows you to upload Jupyter notebooks using the GUI, and allows you to write and export code in a ipynb format. To test the functionality of the code in the notebooks, you can start a databricks 2-core cluster with a free trial and drop these notebooks into your enviornment with the GUI.

These exercises were adaptations of former client problems solved by senior staff at Kubrick Group. They shared their solutions with us in a classroom context, asking us to cross gaps in efficiency, solve similar problems to common requests they encountered, and generally get comofortable with building and using dataframes in a cluster compute context.

Documentation

Apache Spark Documentation

Apache's website also has documentation available here: https://spark.apache.org/docs/latest/

Installation and update notes can be found here:

https://spark.apache.org/docs/latest/api/python/getting_started/install.html

Kafka: The Definitive Guide

Download the Kafka ebook from Confluent.io (a cloud-hosted Kafka application) https://www.confluent.io/resources/kafka-the-definitive-guide/

Databricks Academy https://partner-academy.databricks.com/learn

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published