Merge branch 'main' into delta

aws-samples · Oct 10, 2023 · 8ca5f2b · 8ca5f2b
2 parents 6643344 + ed4c329
commit 8ca5f2b
Show file tree

Hide file tree

Showing 8 changed files with 134 additions and 72 deletions.
diff --git a/README.md b/README.md
@@ -1,32 +1,30 @@
 ## Spark on Kubernetes benchmark utility
 
-This repository is used to benchmark Spark performance on Kubernetes. 
+This repository provides a general tool to benchmark Spark performance.
 If you want to use the [prebuild docker image](https://github.com/aws-samples/emr-on-eks-benchmark/pkgs/container/emr-on-eks-benchmark) based on a prebuild OSS spark_3.1.2_hadoop_3.3.1, you can skip the [build section](#Build-benchmark-utility-docker-image) and jump to [Run Benchmark](#Run-Benchmark) directly. If you want to build your own, follow the steps in the [build section](#Build-benchmark-utility-docker-image).
 
 ## Prerequisite
 
-- eksctl is installed
+- eksctl is installed ( >= 0.143.0)
 ```bash
 curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
 sudo mv -v /tmp/eksctl /usr/local/bin
 eksctl version
 ```
-- Update AWS CLI to the latest (requires aws cli version >= 2.1.14) on macOS. Check out the [link](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) for Linux or Windows
+- Update AWS CLI to the latest (requires aws cli version >= 2.11.23) on macOS. Check out the [link](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) for Linux or Windows
 ```bash
 curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"
 sudo installer -pkg ./AWSCLIV2.pkg -target /
 aws --version
 rm AWSCLIV2.pkg
 ```
-- Install kubectl on macOS, check out the [link](https://kubernetes.io/docs/tasks/tools/) for Linux or Windows.
+- Install kubectl on macOS, check out the [link](https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/) for Linux or Windows.( >= 1.26.4 )
 ```bash
-curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/darwin/amd64/kubectl"
-chmod +x ./kubectl
-sudo mv ./kubectl /usr/local/bin/kubectl && export PATH=/usr/local/bin:$PATH
-sudo chown root: /usr/local/bin/kubectl
+curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
+sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
 kubectl version --short --client
 ```
-- Helm CLI
+- Helm CLI ( >= 3.2.1 )
 ```bash
 curl -sSL https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash
 helm version --short
@@ -64,7 +62,7 @@ aws ecr create-repository --repository-name spark --image-scanning-configuration
 docker build -t $ECR_URL/spark:3.1.2_hadoop_3.3.1 -f docker/hadoop-aws-3.3.1/Dockerfile --build-arg HADOOP_VERSION=3.3.1 --build-arg SPARK_VERSION=3.1.2 .
 docker push $ECR_URL/spark:3.1.2_hadoop_3.3.1
 
-# Build benchmark utility based on the Spark 
+# Build benchmark utility based on the Spark
 docker build -t $ECR_URL/eks-spark-benchmark:3.1.2 -f docker/benchmark-util/Dockerfile --build-arg SPARK_BASE_IMAGE=$ECR_URL/spark:3.1.2_hadoop_3.3.1 .
 ```
 
@@ -116,20 +114,31 @@ bash examples/emr6.5-benchmark.sh
 ```
 ### Benchmark for EMR on EC2
 Few notes for the set up:
-1. Use the same instance type c5d.9xlarge as in the EKS cluster. 
-2. If choosing an EBS-backed instance, check the [default instance storage setting](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html) by EMR on EC2, and attach the same number of EBS volumes to your EKS cluster before running EKS related benchmarks. 
+1. Use the same instance type c5d.9xlarge as in the EKS cluster.
+2. If choosing an EBS-backed instance, check the [default instance storage setting](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html) by EMR on EC2, and attach the same number of EBS volumes to your EKS cluster before running EKS related benchmarks.
 
-The benchmark utility app was compiled to a jar file during an [automated GitHub workflow](https://github.com/aws-samples/emr-on-eks-benchmark/actions/workflows/relase-package.yaml) process. The quickest way to get the jar is from a running Kubernetes container.
+The benchmark utility app was compiled to a jar file during an [automated GitHub workflow](https://github.com/aws-samples/emr-on-eks-benchmark/actions/workflows/relase-package.yaml) process. If you already have a running Kubernetes container, the quickest way to get the jar is using `kubectl cp` command as shown below:
 ```bash
 # Download the jar and ignore the warning message
 kubectl cp oss/oss-spark-tpcds-exec-1:/opt/spark/examples/jars/eks-spark-benchmark-assembly-1.0.jar eks-spark-benchmark-assembly-1.0.jar
+```
+
+However if you are running a benchmark just for EMR on EC2, you probably don\'t have a running container. To copy the jar file from a docker container, you need two terminals. In the first terminal, spin up a docker container based on your image built:
+```bash
+docker run --name spark-benchmark -it $ECR_URL/eks-spark-benchmark:3.1.2 bash
+# you are logged in to the container now, find the jar file
+hadoop@9ca5b2afe778: ls -alh /opt/spark/examples/jars/eks-spark-benchmark-assembly-1.0.jar
+```
+Keep the container running then go to the second terminal, run the command to copy the jar file from the container to your local directory:
+```bash
+docker cp spark-benchmark:/opt/spark/examples/jars/eks-spark-benchmark-assembly-1.0.jar .
 
 # Upload to s3
 S3BUCKET=<S3_BUCKET_HAS_TPCDS_DATASET>
 aws s3 cp eks-spark-benchmark-assembly-1.0.jar s3://$S3BUCKET
 ```
 
-Submit the benchmark job via EMR Step on the AWS console. Make sure the EMR on EC2 cluster can access the `$S3BUCKET`: 
+Submit the benchmark job via EMR Step on the AWS console. Make sure the EMR on EC2 cluster can access the `$S3BUCKET`:
 ```bash
 # Step type: Spark Application
 # JAR location: s3://$S3BUCKET/eks-spark-benchmark-assembly-1.0.jar

diff --git a/docker/benchmark-util/.gitignore b/docker/benchmark-util/.gitignore
@@ -0,0 +1,21 @@
+*.DS_Store
+*.class
+*.log
+*.pyc
+sbt/*.jar
+.idea
+.idea_modules
+*.iml
+
+# sbt specific
+build/*.jar
+.cache/
+.history/
+.lib/
+dist/*
+target/
+lib_managed/
+src_managed/
+project/boot/
+project/plugins/project/
+performance/
diff --git a/docker/benchmark-util/Dockerfile b/docker/benchmark-util/Dockerfile
@@ -13,6 +13,7 @@ RUN yum update -y && \
     make OS=LINUX
 
 
+
 FROM mozilla/sbt:8u292_1.5.4 as sbtenv
 
 # Build the Databricks SQL perf library from the local Spark version
@@ -35,5 +36,4 @@ COPY --from=tpc-toolkit /tmp/tpcds-kit/tools /opt/tpcds-kit/tools
 COPY --from=sbtenv /tmp/emr-on-eks-benchmark/benchmark/target/scala-2.12/*jar ${SPARK_HOME}/examples/jars/
 
 # # Use hadoop user and group 
-USER hadoop:hadoop
-
+USER hadoop:hadoop
diff --git a/docker/emr-jdk11/Dockerfile b/docker/emr-jdk11/Dockerfile
@@ -0,0 +1,12 @@
+FROM 021732063925.dkr.ecr.us-west-2.amazonaws.com/eks-spark-benchmark:emr6.10_jdk8
+USER root
+ENV JAVA_HOME /etc/alternatives/jre
+
+RUN rpm -qa | grep corretto | xargs  yum -y remove \
+# to keep hadoop-lzo dependency
+&& rpm -e --nodeps java-1.8.0-openjdk-headless \ 
+&& amazon-linux-extras install java-openjdk11 \
+&&  yum clean all
+RUN alternatives --set java  /usr/lib/jvm/$(ls /usr/lib/jvm | grep java-11 | cut -f 3)/bin/java
+# # Use hadoop user and group 
+USER hadoop:hadoop
diff --git a/docker/emr-jdk11/Dockerfile_corretto b/docker/emr-jdk11/Dockerfile_corretto
@@ -0,0 +1,19 @@
+FROM 021732063925.dkr.ecr.us-west-2.amazonaws.com/eks-spark-benchmark:emr6.10_jdk8
+USER root
+
+# RUN amazon-linux-extras enable nginx1 \
+# && rpm --import https://yum.corretto.aws/corretto.key \
+# && curl -L -o /etc/yum.repos.d/corretto.repo https://yum.corretto.aws/corretto.repo
+RUN yum update -y \
+&& amazon-linux-extras disable corretto8 \
+# && rpm -qa | grep -E "openjdk|corretto" | xargs  yum -y remove \
+&& rpm -qa | grep corretto | xargs  yum -y remove \
+# to keep hadoop-lzo dependency
+&& rpm -e --nodeps java-1.8.0-openjdk-headless \ 
+&& yum install -y java-11-amazon-corretto \
+&& yum clean all
+
+
+# RUN alternatives --set java  /usr/lib/jvm/$(ls /usr/lib/jvm | grep corretto | cut -f 3)/bin/java
+# # Use hadoop user and group 
+USER hadoop:hadoop
diff --git a/examples/emr6.10-benchmark_c5.sh b/examples/emr6.10-benchmark_c5.sh
@@ -0,0 +1,53 @@
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright 2021 Amazon.com, Inc. or its affiliates.
+# SPDX-License-Identifier: MIT-0
+
+# cross account test
+# "spark.hadoop.fs.s3.bucket.emr-eks-demo-720560070661-us-east-1.customAWSCredentialsProvider": "com.amazonaws.emr.AssumeRoleAWSCredentialsProvider",
+# "spark.kubernetes.driverEnv.ASSUME_ROLE_CREDENTIALS_ROLE_ARN": "arn:aws:iam::720560070661:role/EMRContainers-JobExecutionRole",          
+# "spark.executorEnv.ASSUME_ROLE_CREDENTIALS_ROLE_ARN": "arn:aws:iam::720560070661:role/EMRContainers-JobExecutionRole"   
+
+# export EMRCLUSTER_NAME=emr-on-eks-rss
+# export AWS_REGION=us-east-1
+export ACCOUNTID=$(aws sts get-caller-identity --query Account --output text)
+export VIRTUAL_CLUSTER_ID=$(aws emr-containers list-virtual-clusters --query "virtualClusters[?name == '$EMRCLUSTER_NAME' && state == 'RUNNING'].id" --output text)
+export EMR_ROLE_ARN=arn:aws:iam::$ACCOUNTID:role/$EMRCLUSTER_NAME-execution-role
+export S3BUCKET=$EMRCLUSTER_NAME-$ACCOUNTID-$AWS_REGION
+export ECR_URL="$ACCOUNTID.dkr.ecr.$AWS_REGION.amazonaws.com"
+
+aws emr-containers start-job-run \
+  --virtual-cluster-id $VIRTUAL_CLUSTER_ID \
+  --name emr610-JDK8 \
+  --execution-role-arn $EMR_ROLE_ARN \
+  --release-label emr-6.9.0-latest \
+  --retry-policy-configuration '{"maxAttempts": 5}' \
+  --job-driver '{
+  "sparkSubmitJobDriver": {
+      "entryPoint": "local:///usr/lib/spark/examples/jars/eks-spark-benchmark-assembly-1.0.jar",
+      "entryPointArguments":["s3://'$S3BUCKET'/BLOG_TPCDS-TEST-3T-partitioned","s3://'$S3BUCKET'/JDK_EMRONEKS_TPCDS-TEST-3T-RESULT","/opt/tpcds-kit/tools","parquet","3000","1","false","q1-v2.4,q10-v2.4,q11-v2.4,q12-v2.4,q13-v2.4,q14a-v2.4,q14b-v2.4,q15-v2.4,q16-v2.4,q17-v2.4,q18-v2.4,q19-v2.4,q2-v2.4,q20-v2.4,q21-v2.4,q22-v2.4,q23a-v2.4,q23b-v2.4,q24a-v2.4,q24b-v2.4,q25-v2.4,q26-v2.4,q27-v2.4,q28-v2.4,q29-v2.4,q3-v2.4,q30-v2.4,q31-v2.4,q32-v2.4,q33-v2.4,q34-v2.4,q35-v2.4,q36-v2.4,q37-v2.4,q38-v2.4,q39a-v2.4,q39b-v2.4,q4-v2.4,q40-v2.4,q41-v2.4,q42-v2.4,q43-v2.4,q44-v2.4,q45-v2.4,q46-v2.4,q47-v2.4,q48-v2.4,q49-v2.4,q5-v2.4,q50-v2.4,q51-v2.4,q52-v2.4,q53-v2.4,q54-v2.4,q55-v2.4,q56-v2.4,q57-v2.4,q58-v2.4,q59-v2.4,q6-v2.4,q60-v2.4,q61-v2.4,q62-v2.4,q63-v2.4,q64-v2.4,q65-v2.4,q66-v2.4,q67-v2.4,q68-v2.4,q69-v2.4,q7-v2.4,q70-v2.4,q71-v2.4,q72-v2.4,q73-v2.4,q74-v2.4,q75-v2.4,q76-v2.4,q77-v2.4,q78-v2.4,q79-v2.4,q8-v2.4,q80-v2.4,q81-v2.4,q82-v2.4,q83-v2.4,q84-v2.4,q85-v2.4,q86-v2.4,q87-v2.4,q88-v2.4,q89-v2.4,q9-v2.4,q90-v2.4,q91-v2.4,q92-v2.4,q93-v2.4,q94-v2.4,q95-v2.4,q96-v2.4,q97-v2.4,q98-v2.4,q99-v2.4,ss_max-v2.4","true"],
+      "sparkSubmitParameters": "--class com.amazonaws.eks.tpcds.BenchmarkSQL --conf spark.driver.cores=4 --conf spark.driver.memory=5g --conf spark.executor.cores=4 --conf spark.executor.memory=6g --conf spark.executor.instances=47"}}' \
+  --configuration-overrides '{
+    "applicationConfiguration": [
+      {
+        "classification": "spark-defaults", 
+        "properties": {
+          "spark.kubernetes.container.image": "'$ECR_URL'/eks-spark-benchmark:emr6.10_jdk8",
+          "spark.kubernetes.driver.podTemplateFile": "s3://'$S3BUCKET'/app_code/pod-template/driver-pod-template.yaml",
+          "spark.kubernetes.executor.podTemplateFile": "s3://'$S3BUCKET'/app_code/pod-template/executor-pod-template.yaml",
+          "spark.kubernetes.driver.limit.cores": "4.1",
+          "spark.kubernetes.executor.limit.cores": "4.3",
+          "spark.driver.memoryOverhead": "1000",
+          "spark.executor.memoryOverhead": "2G",
+          "spark.network.timeout": "2000s",
+          "spark.executor.heartbeatInterval": "300s",
+          "spark.kubernetes.node.selector.eks.amazonaws.com/nodegroup": "c59d"
+      }},
+      {
+        "classification": "spark-log4j",
+        "properties": {
+          "rootLogger.level" : "WARN"
+          }
+      }
+    ], 
+    "monitoringConfiguration": {
+      "s3MonitoringConfiguration": {"logUri": "s3://'$S3BUCKET'/elasticmapreduce/emr-containers"}}}'
diff --git a/examples/emr6.6-benchmark_c5.sh b/examples/emr6.6-benchmark_c5.sh
diff --git a/provision.sh b/provision.sh
@@ -9,7 +9,7 @@
 export OSS_SPARK_SVCACCT_NAME=oss
 export OSS_NAMESPACE=oss
 export EMR_NAMESPACE=emr
-export EKS_VERSION=1.21
+export EKS_VERSION=1.26
 export EMRCLUSTER_NAME=emr-on-$EKSCLUSTER_NAME
 export ROLE_NAME=${EMRCLUSTER_NAME}-execution-role
 export ACCOUNTID=$(aws sts get-caller-identity --query Account --output text)
@@ -193,7 +193,7 @@ autoDiscovery:
     clusterName: $EKSCLUSTER_NAME
 awsRegion: $AWS_REGION
 image:
-    tag: v1.21.1
+    tag: v1.26.3
 nodeSelector:
     app: sparktest    
 podAnnotations:
@@ -213,7 +213,8 @@ helm install nodescaler autoscaler/cluster-autoscaler --namespace kube-system --
 
 # Install Spark-Operator for the OSS Spark test
 helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
-helm install -n $OSS_NAMESPACE spark-operator spark-operator/spark-operator --version 1.1.6 \
+helm repo update
+helm install -n $OSS_NAMESPACE spark-operator spark-operator/spark-operator --version 1.1.27 \
   --set serviceAccounts.spark.create=false --set metrics.enable=false --set webhook.enable=true --set webhook.port=443 --debug
 
 echo "============================================================================="