The artifacts included in this setup will help you deploy a standalone Spark cluster using MapR's packages. Each node on the cluster is configured as a MapR client. There is no data or scripts to go along with the setup (Yet) Once the cluster is up and running you can get into one of the nodes and run spark-shell or spark-submit with access to maprfs.
Each Spark node will be assigned a location by Kubernetes. This is based on the resources requested and the availability of the resources in your Kubernetes environment. Because we are using a custom Docker image, we need a local registry available to Kubernetes to pull the image. Each node in the Kubernetes cluster has to be able to access the MapR cluster.
There are some complexities to the Kubernetes networking that I will not address in this document. Make sure you select a Pod network add-on that support communication and DNS resolution between pods/nodes( I am using Flannel).
- Kubernetes installed and working.
- Install a Docker registry that is accessible from all kubernetes nodes.
- Download the artifacts to the Kubernetes master node
- A running Kubernetes Dashboard and connectivity tested from a browser.
Download and extract the artifacts.
Under the docker directory you will find the Dockerfile. Create the image following these steps.
- cd docker
- docker build -f mapr-spark:latest .
- You can name the image whatever you want
It will take a few minutes to build.
Tag and push to your registry
- docker tag mapr-spark:latest :/mapr-spark:latest
- docker push :/mapr-spark:latest
To customize the configuration modify the variables in setup.sh. The file is self documented. At a high level this is what you need to configure:
- The name of your Docker image
- Information about the MapR cluster to connect to
- Ticket, ssl_truststore, CLDB…
Run setup.sh to generate the Yaml files
This configuration is meant to explore the Kubernetes functionality along with using MapR packages. Take the time to start components one at a time to make sure that services are operational.
NOTE: I append a random string to the pod names to facilitate the creation of multiple clusters with the same framework.
Deploy
kubectl apply -f spark_master.yaml
Confirm the pod/container is running (note the name of the pod)
kubectl --namespace=spark-cluster get pods
NAME READY STATUS RESTARTS AGE
**spark-mstr-bydkxg-rc-jmbhq** 1/1 Running 0 1h
Open a command prompt
kubectl --namespace=spark-cluster exec -it spark-mstr-bydkxg-rc-jmbhq /bin/bash
Connect to the Spark UI
http://localhost:8001/api/v1/namespaces/spark-cluster/services/spark-mstr-bydkxg-rc-jmbhq :8080/proxy/
NOTE:
localhost I assume you tunneled to the Kubernetes master node
Make sure you get the pod name correctly
You should get a response from the master, but no worker nodes.
Deploy (default config deploys 2 workers)
kubectl apply -f spark_worker.yaml
Confirm the pod/containers are running
kubectl --namespace=spark-cluster get pods
NAME READY STATUS RESTARTS AGE
spark-mstr-jkchzi-rc-npvmd 1/1 Running 0 2m
spark-wrkr-jkchzi-rc-488nf 1/1 Running 0 13s
spark-wrkr-jkchzi-rc-r7k2w 1/1 Running 0 13s
Connect to the Spark UI
After a minute or so, you should see the worker nodes.
kubectl --namespace=spark-cluster exec -it spark-mstr-bydkxg-rc-jmbhq /bin/bash
root@spark-mstr-jkchzi-rc-npvmd:/# /opt/mapr/hadoop/hadoop-2.7.0/bin/hdfs dfs -ls /
Found 9 items
drwxr-xr-x - mapr 5000 4 2019-01-09 21:24 /apps
drwxr-xr-x - mapr 5000 0 2018-12-06 21:24 /hbase
drwxr-xr-x - mapr 5000 0 2018-12-06 21:36 /opt
drwxr-xr-x - mapr 5000 3 2019-01-11 18:12 /podvols
drwxr-xr-x - mapr 5000 2 2019-01-09 18:37 /pv
drwxr-xr-x - root root 0 2018-12-06 21:36 /tables
drwxrwxrwx - mapr 5000 0 2018-12-06 21:24 /tmp
drwxr-xr-x - root root 3 2018-12-07 01:12 /user
drwxr-xr-x - mapr 5000 1 2018-12-06 21:24 /var
root@spark-mstr-jkchzi-rc-npvmd:/# /opt/mapr/spark/spark-2.3.1/bin/spark-shell
Warning: Unable to determine $DRILL_HOME
2019-01-16 22:19:47 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://spark-mstr-jkchzi-rc-npvmd:4040
Spark context available as 'sc' (master = local[*], app id = local-1547677196049).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.1-mapr-1808
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
scala> val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
warning: there was one deprecation warning; re-run with -deprecation for details
sqlcontext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@1b1ea1d9
scala> val df = sqlcontext.read.json("/podvols/iotdata.json")
....
df: org.apache.spark.sql.DataFrame = [device: string, guid: string ... 7 more fields]
scala> df.show(1)
+------+--------------------+------+------+------+--------------+----+-------------------+--------+
|device| guid|metric|millis|sensor| senstype|site| timestamp|timezone|
+------+--------------------+------+------+------+--------------+----+-------------------+--------+
| dev1|6f3743c6-01e1-4fb...|1000.0| 588| sen/2|currentPresure| S1|2019-01-11T10:04:03| -0800|
+------+--------------------+------+------+------+--------------+----+-------------------+--------+
only showing top 1 row
scala>
The easiest way to remove all artifacts is to run
kubectl delete namespace spark-cluster
Change the namespace if you changed it in the setup.
Do you need more power? Add more workers…
Get the name of the worker replication controller. In this example it has 2 replicas running (2 Spark worker nodes)
**kubectl --namespace=spark-cluster get rc**
NAME DESIRED CURRENT READY AGE
spark-mstr-jkchzi-rc 1 1 1 11m
**spark-wrkr-jkchzi-rc** 2 2 2 2m
Add replicas (worker node).
**kubectl --namespace=spark-cluster scale --current-replicas=2 --replicas=3 rc/spark-wrkr-jkchzi-rc**
replicationcontroller "spark-wrkr-jkchzi-rc" scaled
You should now see one more worker node
kubectl --namespace=spark-cluster get rc
NAME DESIRED CURRENT READY AGE
spark-mstr-jkchzi-rc 1 1 1 18m
spark-wrkr-jkchzi-rc 3 3 3 9m
References
Install cluster with Kubeadm
https://kubernetes.io/docs/setup/independent/install-kubeadm/
https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/