Skip to content

Spark Performance Benchmark suite to evaluate all TPC-DS and TPC-H query times

Notifications You must be signed in to change notification settings

satyakommula96/spark_benchmark

Repository files navigation

TPC-H and TPC-DS performance benchmark for Spark with DeltaLake

Based on: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores

spark_benchmark

Table of Contents

  1. Clone submodules
  2. Prerequisites
  3. Build
  4. Performance Benchmarking
  5. Reports

1. Clone submodules

git submodule init
git submodule update

2. Prerequisites

2.1 Install Dependencies

sudo apt-get install unzip zip gcc make flex bison byacc git build-essential -y
curl -s "https://get.sdkman.io" | bash
source "$HOME/.sdkman/bin/sdkman-init.sh"
sdk install java 8.0.302-open
sdk install sbt 0.13.18

2.2 Download spark

wget https://archive.apache.org/dist/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
mkdir spark3
tar -xzf spark-3.2.0-bin-hadoop3.2.tgz --strip 1 -C spark3

NOTE: Deploy spark in standalone cluster mode.

2.3 Download sbt launcher

wget -P /tmp https://github.com/sbt/sbt/releases/download/v0.13.18/sbt-0.13.18.tgz
tar -xf /tmp/sbt-0.13.18.tgz  -C /tmp

3. Build

3.1 spark-sql-perf

cd spark-sql-perf
cp /tmp/sbt/bin/sbt-launch.jar build/sbt-launch-0.13.18.jar
bin/run
sbt +package

3.2 tpcds-kit

cd ../tpcds-kit/tools
make CC=gcc-9 OS=LINUX

3.3 tpch-dbgen

cd ../../tpch-dbgen
git checkout 0469309147b42abac8857fa61b4cf69a6d3128a8 -- bm_utils.c
make

NOTE: This should be installed on all cluster nodes with the same location and build tpcds-kit, tpch-dbgen

4. Performance Benchmarking

NOTE: Change master-ip executor-memory , num-executors,executor-cores a/c to your machine specifications in .sh files.

4.1 TPCH

4.1.1 Parquet

cd ../tpch
#For generating ~100GB parquet data
./gendata_parquet.sh
# For runing all 22 TPC-H Queries
./runtpch_parquet.sh

4.1.2 ORC

cd ../tpch
#For generating ~100GB orc data
./gendata_orc.sh
# For runing all 22 TPC-H Queries
./runtpch_orc.sh

4.1.3 CSV

cd ../tpch
#For generating ~100GB csv data
./gendata_csv.sh
# For runing all 22 TPC-H Queries
./runtpch_csv.sh

4.1.4 JSON

cd ../tpch
#For generating ~100GB csv data
./gendata_json.sh
# For runing all 22 TPC-H Queries
./runtpch_son.sh

4.2 TPCDS

4.2.1 Parquet

cd ../tpcds
#For generating ~100GB parquet data
./gendata_parquet.sh
# For runing all 99 TPC-DS Queries
./runtpch_parquet.sh

4.2.2 ORC

cd ../tpcds
#For generating ~100GB orc data
./gendata_orc.sh
# For runing all 99 TPC-DS Queries
./runtpch_orc.sh

4.2.3 CSV

cd ../tpcds
#For generating ~100GB csv data
./gendata_csv.sh
# For runing all 99 TPC-DS Queries
./runtpch_csv.sh

4.2.4 JSON

cd ../tpcds
#For generating ~100GB csv data
./gendata_json.sh
# For runing all 99 TPC-DS Queries
./runtpch_json.sh

## 5. Reports

### 5.1 TPCH reports

```bash
cd tpch/tpch_<parquet,orc,csv>_reports
# result will be present in part*.csv file

5.2 TPCDS reports

cd tpcds/tpcds_<parquet,orc,csv,json>_reports
# result will be present in part*.csv file

About

Spark Performance Benchmark suite to evaluate all TPC-DS and TPC-H query times

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published