-
Notifications
You must be signed in to change notification settings - Fork 0
Running Shark on a Cluster
This guide describes how to get Shark up and running on a cluster. If you are interested in using Shark on Amazon EC2, see page Running Shark on EC2 to use the set of EC2 scripts to launch a pre-configured cluster in a few mins.
NOTE: Shark is a drop-in tool that can be used on top of existing Hive warehouses. It does not require that you change your existing Hive deployment in any way.
Running Shark on a cluster requires the following external components:
- Scala 2.9.3
- Spark 0.8.0
- A compatible Java Runtime: OpenJDK 7, Oracle HotSpot JDK 7, or Oracle HotSpot JDK 6u23+
- The Shark-specific Hive JAR (based on Hive 0.9), included in the Shark binary distribution
- A HDFS cluster: setup not included in this guide.
Note that unlike earlier version of Spark and Shark, running the latest version on a cluster NO LONGER requires Apache Mesos.
If you don't have Scala 2.9.3 installed on your system, you can download it by:
$ wget http://www.scala-lang.org/files/archive/scala-2.9.3.tgz
$ tar xvfz scala-2.9.3.tgz
We are using Spark's standalone deployment mode to run Shark on a cluster. You can click on this click to find more information.
Download Spark:
$ wget http://spark-project.org/download/spark-0.8.0-incubating-bin-hadoop1.tgz # Hadoop 1/CDH3 - or -
$ wget http://spark-project.org/download/spark-0.8.0-incubating-bin-cdh4.tgz # Hadoop 2/CDH4
$ tar xvfz spark-*-bin-*.tgz
$ mv spark-*-bin-*/ spark-0.8.0
Edit spark-0.8.0/conf/slaves
to add the hostname of each slave, one per line.
Edit spark-0.8.0/conf/spark-env.sh
to set SCALA_HOME and SPARK_WORKER_MEMORY
export SCALA_HOME=/path/to/scala-2.9.3
export SPARK_WORKER_MEMORY=16g
SPARK_WORKER_MEMORY is the maximum amount of memory that Spark can use on each node. Increasing this allows more data to be cached, but be sure to leave memory (e.g. 1 GB) for the OS and any other services that the node may be running.
Download the binary distribution of Shark 0.8. The package contains two folders, shark-0.8.0
and hive-0.9.0-shark-0.8.0-bin
.
$ wget https://github.com/amplab/shark/releases/download/v0.8.0/shark-0.8.0-bin-hadoop1.tgz # Hadoop 1/CDH3 - or -
$ wget https://github.com/amplab/shark/releases/download/v0.8.0/shark-0.8.0-bin-cdh4.tgz # Hadoop 2/CDH4
$ tar xvfz shark-*-bin-*.tgz
$ cd shark-*-bin-*
Now edit shark-0.8.0/conf/shark-env.sh
to set the HIVE_HOME, SCALA_HOME and MASTER environmental variables. The master URI must exactly match the spark://
URI shown at port 8080 of the standalone master.
export HADOOP_HOME=/path/to/hadoop
export HIVE_HOME=/path/to/hive-0.9.0-shark-0.8.0-bin
export MASTER=<Master URI>
export SPARK_HOME=/path/to/spark
export SPARK_MEM=16g
source $SPARK_HOME/conf/spark-env.sh
The last line is there to avoid setting SCALA_HOME in two places. Make sure SPARK_MEM is not larger than SPARK_WORKER_MEMORY set in the previous section.
If you are using Shark on an existing Hive installation, be sure to set HIVE_CONF_DIR (in shark-env.sh) to a folder containing your configuration files. Alternatively, copy your Hive XML configuration files into Shark's hive-0.9.0-bin/conf
. For example:
cp /etc/hive/conf/*.xml /path/to/hive-0.9.0-bin/conf/
Copy the Spark and Shark directories to slaves. We assume that the user on the master can SSH to the slaves. For example:
$ while read slave_host; do
$ rsync -Pav spark-0.8.0 shark-0.8.0 $slave_host
$ done < /path/to/spark/conf/slaves
Launch the cluster by running the Spark cluster launch scripts:
$ cd spark-0.8.0
$ ./bin/start_all.sh
The newest versions of Hadoop require additional configuration options. You may need to set the following values inside of Hive's configuration file (hive-site.xml):
-
fs.default.name
: Should point to the URI of your HDFS namenode. E.g. hdfs://myNameNode:8020/ -
fs.defaultFS
: Should be equal tofs.default.name
-
mapred.job.tracker
: Should list the host:port of your JobTracker or be set to "NONE" if you are only using Spark. Note that this needs to be explicitly set even if you aren't using a JobTracker. -
mapreduce.framework.name
: Should be set to a non-empty string, e.g. "NONE".
You can now launch Shark with the command
$ ./bin/shark-withinfo
More detailed information on Spark standalone scripts and options is also available.
To verify that Shark is running, you can try the following example, which creates a table with sample data:
CREATE TABLE src(key INT, value STRING);
LOAD DATA LOCAL INPATH '${env:HIVE_HOME}/examples/files/kv1.txt' INTO TABLE src;
SELECT COUNT(1) FROM src;
CREATE TABLE src_cached AS SELECT * FROM SRC;
SELECT COUNT(1) FROM src_cached;
See the Shark User Guide for more details on using Shark.