Skip to content

Commit

Permalink
Merge branch 'main' into delta
Browse files Browse the repository at this point in the history
  • Loading branch information
melodyyangaws authored Oct 10, 2023
2 parents 6643344 + ed4c329 commit 8ca5f2b
Show file tree
Hide file tree
Showing 8 changed files with 134 additions and 72 deletions.
37 changes: 23 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,30 @@
## Spark on Kubernetes benchmark utility

This repository is used to benchmark Spark performance on Kubernetes.
This repository provides a general tool to benchmark Spark performance.
If you want to use the [prebuild docker image](https://github.com/aws-samples/emr-on-eks-benchmark/pkgs/container/emr-on-eks-benchmark) based on a prebuild OSS spark_3.1.2_hadoop_3.3.1, you can skip the [build section](#Build-benchmark-utility-docker-image) and jump to [Run Benchmark](#Run-Benchmark) directly. If you want to build your own, follow the steps in the [build section](#Build-benchmark-utility-docker-image).

## Prerequisite

- eksctl is installed
- eksctl is installed ( >= 0.143.0)
```bash
curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
sudo mv -v /tmp/eksctl /usr/local/bin
eksctl version
```
- Update AWS CLI to the latest (requires aws cli version >= 2.1.14) on macOS. Check out the [link](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) for Linux or Windows
- Update AWS CLI to the latest (requires aws cli version >= 2.11.23) on macOS. Check out the [link](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) for Linux or Windows
```bash
curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"
sudo installer -pkg ./AWSCLIV2.pkg -target /
aws --version
rm AWSCLIV2.pkg
```
- Install kubectl on macOS, check out the [link](https://kubernetes.io/docs/tasks/tools/) for Linux or Windows.
- Install kubectl on macOS, check out the [link](https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/) for Linux or Windows.( >= 1.26.4 )
```bash
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/darwin/amd64/kubectl"
chmod +x ./kubectl
sudo mv ./kubectl /usr/local/bin/kubectl && export PATH=/usr/local/bin:$PATH
sudo chown root: /usr/local/bin/kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
kubectl version --short --client
```
- Helm CLI
- Helm CLI ( >= 3.2.1 )
```bash
curl -sSL https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash
helm version --short
Expand Down Expand Up @@ -64,7 +62,7 @@ aws ecr create-repository --repository-name spark --image-scanning-configuration
docker build -t $ECR_URL/spark:3.1.2_hadoop_3.3.1 -f docker/hadoop-aws-3.3.1/Dockerfile --build-arg HADOOP_VERSION=3.3.1 --build-arg SPARK_VERSION=3.1.2 .
docker push $ECR_URL/spark:3.1.2_hadoop_3.3.1

# Build benchmark utility based on the Spark
# Build benchmark utility based on the Spark
docker build -t $ECR_URL/eks-spark-benchmark:3.1.2 -f docker/benchmark-util/Dockerfile --build-arg SPARK_BASE_IMAGE=$ECR_URL/spark:3.1.2_hadoop_3.3.1 .
```

Expand Down Expand Up @@ -116,20 +114,31 @@ bash examples/emr6.5-benchmark.sh
```
### Benchmark for EMR on EC2
Few notes for the set up:
1. Use the same instance type c5d.9xlarge as in the EKS cluster.
2. If choosing an EBS-backed instance, check the [default instance storage setting](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html) by EMR on EC2, and attach the same number of EBS volumes to your EKS cluster before running EKS related benchmarks.
1. Use the same instance type c5d.9xlarge as in the EKS cluster.
2. If choosing an EBS-backed instance, check the [default instance storage setting](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html) by EMR on EC2, and attach the same number of EBS volumes to your EKS cluster before running EKS related benchmarks.

The benchmark utility app was compiled to a jar file during an [automated GitHub workflow](https://github.com/aws-samples/emr-on-eks-benchmark/actions/workflows/relase-package.yaml) process. The quickest way to get the jar is from a running Kubernetes container.
The benchmark utility app was compiled to a jar file during an [automated GitHub workflow](https://github.com/aws-samples/emr-on-eks-benchmark/actions/workflows/relase-package.yaml) process. If you already have a running Kubernetes container, the quickest way to get the jar is using `kubectl cp` command as shown below:
```bash
# Download the jar and ignore the warning message
kubectl cp oss/oss-spark-tpcds-exec-1:/opt/spark/examples/jars/eks-spark-benchmark-assembly-1.0.jar eks-spark-benchmark-assembly-1.0.jar
```

However if you are running a benchmark just for EMR on EC2, you probably don\'t have a running container. To copy the jar file from a docker container, you need two terminals. In the first terminal, spin up a docker container based on your image built:
```bash
docker run --name spark-benchmark -it $ECR_URL/eks-spark-benchmark:3.1.2 bash
# you are logged in to the container now, find the jar file
hadoop@9ca5b2afe778: ls -alh /opt/spark/examples/jars/eks-spark-benchmark-assembly-1.0.jar
```
Keep the container running then go to the second terminal, run the command to copy the jar file from the container to your local directory:
```bash
docker cp spark-benchmark:/opt/spark/examples/jars/eks-spark-benchmark-assembly-1.0.jar .

# Upload to s3
S3BUCKET=<S3_BUCKET_HAS_TPCDS_DATASET>
aws s3 cp eks-spark-benchmark-assembly-1.0.jar s3://$S3BUCKET
```

Submit the benchmark job via EMR Step on the AWS console. Make sure the EMR on EC2 cluster can access the `$S3BUCKET`:
Submit the benchmark job via EMR Step on the AWS console. Make sure the EMR on EC2 cluster can access the `$S3BUCKET`:
```bash
# Step type: Spark Application
# JAR location: s3://$S3BUCKET/eks-spark-benchmark-assembly-1.0.jar
Expand Down
21 changes: 21 additions & 0 deletions docker/benchmark-util/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
*.DS_Store
*.class
*.log
*.pyc
sbt/*.jar
.idea
.idea_modules
*.iml

# sbt specific
build/*.jar
.cache/
.history/
.lib/
dist/*
target/
lib_managed/
src_managed/
project/boot/
project/plugins/project/
performance/
4 changes: 2 additions & 2 deletions docker/benchmark-util/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ RUN yum update -y && \
make OS=LINUX



FROM mozilla/sbt:8u292_1.5.4 as sbtenv

# Build the Databricks SQL perf library from the local Spark version
Expand All @@ -35,5 +36,4 @@ COPY --from=tpc-toolkit /tmp/tpcds-kit/tools /opt/tpcds-kit/tools
COPY --from=sbtenv /tmp/emr-on-eks-benchmark/benchmark/target/scala-2.12/*jar ${SPARK_HOME}/examples/jars/

# # Use hadoop user and group
USER hadoop:hadoop

USER hadoop:hadoop
12 changes: 12 additions & 0 deletions docker/emr-jdk11/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
FROM 021732063925.dkr.ecr.us-west-2.amazonaws.com/eks-spark-benchmark:emr6.10_jdk8
USER root
ENV JAVA_HOME /etc/alternatives/jre

RUN rpm -qa | grep corretto | xargs yum -y remove \
# to keep hadoop-lzo dependency
&& rpm -e --nodeps java-1.8.0-openjdk-headless \
&& amazon-linux-extras install java-openjdk11 \
&& yum clean all
RUN alternatives --set java /usr/lib/jvm/$(ls /usr/lib/jvm | grep java-11 | cut -f 3)/bin/java
# # Use hadoop user and group
USER hadoop:hadoop
19 changes: 19 additions & 0 deletions docker/emr-jdk11/Dockerfile_corretto
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
FROM 021732063925.dkr.ecr.us-west-2.amazonaws.com/eks-spark-benchmark:emr6.10_jdk8
USER root

# RUN amazon-linux-extras enable nginx1 \
# && rpm --import https://yum.corretto.aws/corretto.key \
# && curl -L -o /etc/yum.repos.d/corretto.repo https://yum.corretto.aws/corretto.repo
RUN yum update -y \
&& amazon-linux-extras disable corretto8 \
# && rpm -qa | grep -E "openjdk|corretto" | xargs yum -y remove \
&& rpm -qa | grep corretto | xargs yum -y remove \
# to keep hadoop-lzo dependency
&& rpm -e --nodeps java-1.8.0-openjdk-headless \
&& yum install -y java-11-amazon-corretto \
&& yum clean all


# RUN alternatives --set java /usr/lib/jvm/$(ls /usr/lib/jvm | grep corretto | cut -f 3)/bin/java
# # Use hadoop user and group
USER hadoop:hadoop
53 changes: 53 additions & 0 deletions examples/emr6.10-benchmark_c5.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
#!/bin/bash
# SPDX-FileCopyrightText: Copyright 2021 Amazon.com, Inc. or its affiliates.
# SPDX-License-Identifier: MIT-0

# cross account test
# "spark.hadoop.fs.s3.bucket.emr-eks-demo-720560070661-us-east-1.customAWSCredentialsProvider": "com.amazonaws.emr.AssumeRoleAWSCredentialsProvider",
# "spark.kubernetes.driverEnv.ASSUME_ROLE_CREDENTIALS_ROLE_ARN": "arn:aws:iam::720560070661:role/EMRContainers-JobExecutionRole",
# "spark.executorEnv.ASSUME_ROLE_CREDENTIALS_ROLE_ARN": "arn:aws:iam::720560070661:role/EMRContainers-JobExecutionRole"

# export EMRCLUSTER_NAME=emr-on-eks-rss
# export AWS_REGION=us-east-1
export ACCOUNTID=$(aws sts get-caller-identity --query Account --output text)
export VIRTUAL_CLUSTER_ID=$(aws emr-containers list-virtual-clusters --query "virtualClusters[?name == '$EMRCLUSTER_NAME' && state == 'RUNNING'].id" --output text)
export EMR_ROLE_ARN=arn:aws:iam::$ACCOUNTID:role/$EMRCLUSTER_NAME-execution-role
export S3BUCKET=$EMRCLUSTER_NAME-$ACCOUNTID-$AWS_REGION
export ECR_URL="$ACCOUNTID.dkr.ecr.$AWS_REGION.amazonaws.com"

aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name emr610-JDK8 \
--execution-role-arn $EMR_ROLE_ARN \
--release-label emr-6.9.0-latest \
--retry-policy-configuration '{"maxAttempts": 5}' \
--job-driver '{
"sparkSubmitJobDriver": {
"entryPoint": "local:///usr/lib/spark/examples/jars/eks-spark-benchmark-assembly-1.0.jar",
"entryPointArguments":["s3://'$S3BUCKET'/BLOG_TPCDS-TEST-3T-partitioned","s3://'$S3BUCKET'/JDK_EMRONEKS_TPCDS-TEST-3T-RESULT","/opt/tpcds-kit/tools","parquet","3000","1","false","q1-v2.4,q10-v2.4,q11-v2.4,q12-v2.4,q13-v2.4,q14a-v2.4,q14b-v2.4,q15-v2.4,q16-v2.4,q17-v2.4,q18-v2.4,q19-v2.4,q2-v2.4,q20-v2.4,q21-v2.4,q22-v2.4,q23a-v2.4,q23b-v2.4,q24a-v2.4,q24b-v2.4,q25-v2.4,q26-v2.4,q27-v2.4,q28-v2.4,q29-v2.4,q3-v2.4,q30-v2.4,q31-v2.4,q32-v2.4,q33-v2.4,q34-v2.4,q35-v2.4,q36-v2.4,q37-v2.4,q38-v2.4,q39a-v2.4,q39b-v2.4,q4-v2.4,q40-v2.4,q41-v2.4,q42-v2.4,q43-v2.4,q44-v2.4,q45-v2.4,q46-v2.4,q47-v2.4,q48-v2.4,q49-v2.4,q5-v2.4,q50-v2.4,q51-v2.4,q52-v2.4,q53-v2.4,q54-v2.4,q55-v2.4,q56-v2.4,q57-v2.4,q58-v2.4,q59-v2.4,q6-v2.4,q60-v2.4,q61-v2.4,q62-v2.4,q63-v2.4,q64-v2.4,q65-v2.4,q66-v2.4,q67-v2.4,q68-v2.4,q69-v2.4,q7-v2.4,q70-v2.4,q71-v2.4,q72-v2.4,q73-v2.4,q74-v2.4,q75-v2.4,q76-v2.4,q77-v2.4,q78-v2.4,q79-v2.4,q8-v2.4,q80-v2.4,q81-v2.4,q82-v2.4,q83-v2.4,q84-v2.4,q85-v2.4,q86-v2.4,q87-v2.4,q88-v2.4,q89-v2.4,q9-v2.4,q90-v2.4,q91-v2.4,q92-v2.4,q93-v2.4,q94-v2.4,q95-v2.4,q96-v2.4,q97-v2.4,q98-v2.4,q99-v2.4,ss_max-v2.4","true"],
"sparkSubmitParameters": "--class com.amazonaws.eks.tpcds.BenchmarkSQL --conf spark.driver.cores=4 --conf spark.driver.memory=5g --conf spark.executor.cores=4 --conf spark.executor.memory=6g --conf spark.executor.instances=47"}}' \
--configuration-overrides '{
"applicationConfiguration": [
{
"classification": "spark-defaults",
"properties": {
"spark.kubernetes.container.image": "'$ECR_URL'/eks-spark-benchmark:emr6.10_jdk8",
"spark.kubernetes.driver.podTemplateFile": "s3://'$S3BUCKET'/app_code/pod-template/driver-pod-template.yaml",
"spark.kubernetes.executor.podTemplateFile": "s3://'$S3BUCKET'/app_code/pod-template/executor-pod-template.yaml",
"spark.kubernetes.driver.limit.cores": "4.1",
"spark.kubernetes.executor.limit.cores": "4.3",
"spark.driver.memoryOverhead": "1000",
"spark.executor.memoryOverhead": "2G",
"spark.network.timeout": "2000s",
"spark.executor.heartbeatInterval": "300s",
"spark.kubernetes.node.selector.eks.amazonaws.com/nodegroup": "c59d"
}},
{
"classification": "spark-log4j",
"properties": {
"rootLogger.level" : "WARN"
}
}
],
"monitoringConfiguration": {
"s3MonitoringConfiguration": {"logUri": "s3://'$S3BUCKET'/elasticmapreduce/emr-containers"}}}'
53 changes: 0 additions & 53 deletions examples/emr6.6-benchmark_c5.sh

This file was deleted.

7 changes: 4 additions & 3 deletions provision.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
export OSS_SPARK_SVCACCT_NAME=oss
export OSS_NAMESPACE=oss
export EMR_NAMESPACE=emr
export EKS_VERSION=1.21
export EKS_VERSION=1.26
export EMRCLUSTER_NAME=emr-on-$EKSCLUSTER_NAME
export ROLE_NAME=${EMRCLUSTER_NAME}-execution-role
export ACCOUNTID=$(aws sts get-caller-identity --query Account --output text)
Expand Down Expand Up @@ -193,7 +193,7 @@ autoDiscovery:
clusterName: $EKSCLUSTER_NAME
awsRegion: $AWS_REGION
image:
tag: v1.21.1
tag: v1.26.3
nodeSelector:
app: sparktest
podAnnotations:
Expand All @@ -213,7 +213,8 @@ helm install nodescaler autoscaler/cluster-autoscaler --namespace kube-system --

# Install Spark-Operator for the OSS Spark test
helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
helm install -n $OSS_NAMESPACE spark-operator spark-operator/spark-operator --version 1.1.6 \
helm repo update
helm install -n $OSS_NAMESPACE spark-operator spark-operator/spark-operator --version 1.1.27 \
--set serviceAccounts.spark.create=false --set metrics.enable=false --set webhook.enable=true --set webhook.port=443 --debug

echo "============================================================================="
Expand Down

0 comments on commit 8ca5f2b

Please sign in to comment.