Running Shark Locally

This guide describes how to get Spark running locally. It creates a small Hive installation on one machine and allows you to execute simple queries. The only prerequisite for this guide is that you have Java and Scala 2.9.3 installed on your machine. If you don't have Scala 2.9.3, you can download it by running:

$ wget http://www.scala-lang.org/downloads/distrib/files/scala-2.9.3.tgz
$ tar xvfz scala-2.9.3.tgz

Download the binary distribution of Shark 0.7.0. The package contains two folders, shark-0.7.0 and hive-0.9.0-bin.

$ http://spark-project.org/download/shark-0.7.0-hadoop2-bin.tgz   # Hadoop 1/CDH3 - or -
$ http://spark-project.org/download/shark-0.7.0-hadoop1-bin.tgz   # Hadoop 2/CDH4

$ tar xvfz shark-0.7.0-*-bin.tgz

The Shark code is in the shark-0.7.0/ directory; to allow local execution, you need to set HIVE_HOME and SCALA_HOME environmental variables in conf/shark-env.sh to point to the folders you just downloaded:

export HIVE_HOME=/path/to/hive-0.9.0-bin
export SCALA_HOME=/path/to/scala-2.9.3

Next, create the default Hive warehouse directory. This is where Hive will store table data for native tables.

$ sudo mkdir -p /user/hive/warehouse
$ sudo chmod 0777 /user/hive/warehouse  # Or make your username the owner

You can now start the Shark CLI:

$ ./bin/shark

To verify that Shark is running, you can try the following example, which creates a table with sample data:

CREATE TABLE src(key INT, value STRING);
LOAD DATA LOCAL INPATH '${env:HIVE_HOME}/examples/files/kv1.txt' INTO TABLE src;
SELECT COUNT(1) FROM src;
CREATE TABLE src_cached AS SELECT * FROM SRC;
SELECT COUNT(1) FROM src_cached;

In addition to the Shark CLI, there are several executables in shark-0.7.0/bin:

bin/shark: Runs Shark CLI.
bin/shark-withinfo: Runs Shark with INFO level logs printed to the console.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running Shark Locally

Clone this wiki locally