Vagrant template to provision a Spark Cluster
- See
Vagrantfile
for details and to make changes. - Spark running as a standalone cluster. Tested with Spark 2.1.x (check Spark Connector Compatability).
- One head node Centos 7.4 machine and
N
worker (slave) machines [0-9]. - Spark running in standalone cluster mode.
- Vagrant
- Virtual Box
- Vagrant Hosts Plugin:
vagrant plugin install vagrant-hosts
- This allows us to provision the hosts files for all the instances.
-
Clone this repository.
-
Download a pre-built Spark package and place it into this directory; symlink as "spark.tgz".
-
Download the Spark Connector and place it into this directory.
-
Open up
Vagrantfile
in a text editor. -
Optional: Change the
N_WORKERS
to the number of desired worker hosts [0-9]. -
Feel free to make other changes, e.g. RAM and CPU for each of the machines.
-
When you're ready, just run
vagrant up
in the directory theVagrantfile
is in.- By Default: Vagrant will spin up one "head node" and
N
worker nodes in a Spark standalone cluster. - You can start a standalone instance using: vagrant up hn0
- By Default: Vagrant will spin up one "head node" and
-
SSH in using
vagrant ssh hn0
orvagrant ssh wn0
. -
Spark is running as
root
. -
The Spark WebUI should be available at
http://192.168.99.200:8080
.
sudo jps -ml
PID org.apache.spark.deploy.master.Master --host 192.168.99.200 --port 7077 --webui-port 8080 -h 192.168.99.200
-
Setup Environment Vairables
GSC_JAR=$(ls /vagrant/greenplum-spark_2.11-*.jar)
POSTGRES_JAR=$(ls /vagrant/postgresql-*.jar)
-
Run SCALA
Spark Connector 1.2: Read from Greenplum with Spark Connector / Write to Greenplum with JDBC:
sudo spark-shell --jars "${GSC_JAR},${POSTGRES_JAR}" --driver-class-path ${POSTGRES_JAR}
Spark Connector 1.3+: Read and Write to Greenplum with Spark Connector:
sudo spark-shell --jars "${GSC_JAR}"
-
Shut down the cluster with
vagrant halt
and delete it withvagrant destroy
. -
You can always run
vagrant up
to turn on or build a brand new cluster.
See the LICENSE.txt file.