-
Notifications
You must be signed in to change notification settings - Fork 327
Running Shark Locally
This guide describes how to get Spark running locally. It creates a small Hive installation on one machine and allows you to execute simple queries. The only prerequisite for this guide is that you have Java and Scala 2.9.2 installed on your machine. If you don't have Scala 2.9.2, you can download it by running:
$ wget http://www.scala-lang.org/downloads/distrib/files/scala-2.9.2.tgz
$ tar xvfz scala-2.9.2.tgz
Download the binary distribution of Shark 0.2.1. The package contains two folders, shark-0.2.1
and hive-0.9.0-bin
.
$ wget http://spark-project.org/download-shark-0.2.1-bin.tgz #for CDH4: download-shark-0.2.1-hadoop2-bin.tgz
$ tar xvfz shark-0.2.1-bin.tgz
The Shark code is in the shark-0.2.1/
directory; to allow local execution, you need to set HIVE_HOME and SCALA_HOME environmental variables in conf/shark-env.sh
to point to the folders you just downloaded:
export HIVE_HOME=/path/to/hive-0.9.0-bin
export SCALA_HOME=/path/to/scala-2.9.2
Next, create the default Hive warehouse directory. This is where Hive will store table data for native tables.
$ sudo mkdir -p /user/hive/warehouse
$ sudo chmod 0777 /user/hive/warehouse # Or make your username the owner
You can now start the Shark CLI:
$ ./bin/shark
To verify that Shark is running, you can try the following example, which creates a table with sample data:
CREATE TABLE src(key INT, value STRING);
LOAD DATA LOCAL INPATH '${env:HIVE_HOME}/examples/files/kv1.txt' INTO TABLE src;
SELECT COUNT(1) FROM src;
CREATE TABLE src_cached AS SELECT * FROM SRC;
SELECT COUNT(1) FROM src_cached;
In addition to the Shark CLI, there are several executables in shark-0.2.1/bin
:
-
bin/shark
: Runs Shark CLI. -
bin/shark-withinfo
: Runs Shark with INFO level logs printed to the console.