Skip to content

Latest commit

 

History

History
48 lines (33 loc) · 2.6 KB

pyspark.md

File metadata and controls

48 lines (33 loc) · 2.6 KB

PySpark

Much of this guide is built from the following sources.

Starting a pyspark session

Import the correct methods

There are older spark instantiation methods under the Context names - SparkContext, SQLContext, and HiveContext. We don't need those with the SparkSession method that was introduced in Spark 2. We will use the SparkConf method to configure a few settings of our spark environment. Finally, the spark SQL functions are a must to run optimized spark code. I have elected to import them with the abbreviation F.

# Databricks handles the first two imports.
from pyspark.sql import SparkSession 
from pyspark import SparkConf 
# Will need to execute this on Databricks
from pyspark.sql import functions as F # access to the sql functions https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

You can read configuration_docker.md to see additional settings that can help during a spark session.

Methods

.pivot()

The shape of our data is important for machine learning and visualization. In many situations you will move back and forth between 'wide formats' and 'long formats' as you build data for your project. Pyspark has the .pivot() and

Collecting a list of values using ArrayType

Spark DataFrames allow us to store arrays in a cell. These variables are called ArrayType columns.

Aggregated Calculations

See [aggregate_calculations.md]