-
Notifications
You must be signed in to change notification settings - Fork 0
Developer Guide
create table src(key int, value string);
LOAD DATA LOCAL INPATH '${env:HIVE_DEV_HOME}/data/files/kv1.txt' INTO TABLE src;
create table src1(key int, value string);
LOAD DATA LOCAL INPATH '${env:HIVE_DEV_HOME}/data/files/kv3.txt' INTO TABLE src1;
Note that you may have to create a /user/hive/warehouse/src
path before executing these commands.
To get up and running quickly with the Shark master branch built against Spark master, Hive, and Hadoop, check out the bin/dev/run-tests-from-scratch
script which comes as part of the Shark repository. This script automatically downloads all of Shark's dependencies for you (except for Java). This script was developed in part to aid in the automatic testing of Shark, but also aims to be a useful reference for new developers to use when getting Shark running in their local development environment. Run the script with the -h
flag to see all options, and specifically check out the -t
flag to skip building and running the test suites while still set up Shark's dependencies and build Shark.
Get the latest version of Shark.
$ git clone https://github.com/amplab/shark.git
Development of Shark (run tests or use Eclipse) requires the (patched) development package of Hive. Clone it from github and package it:
$ git clone https://github.com/amplab/hive.git -b shark-0.9
$ cd hive
$ ant package
ant package
builds all Hive jars and put them into build/dist
directory.
NOTE: On the EC2 AMI, you may have to first install ant-antlr.noarch
and ant-contrib.noarch
:
$ yum install ant-antlr.noarch
$ yum install ant-contrib.noarch
If you are trying to build Hive on your local machine and (a) your distribution doesn't have yum or (b) the above yum commands don't work out of the box with your distro, then you probably want to upgrade to a newer version of ant. ant >= 1.8.2 should work. Download ant binaries at http://ant.apache.org/bindownload.cgi. You might also be able to upgrade to a newer version of ant using a package manager, however on older versions of CentOS, e.g. 6.4, yum can't install ant 1.8 out of the box so installing ant by downloading the binary installation package is recommended.
Edit shark/conf/shark-env.sh
and set the following for running local mode:
#!/usr/bin/env bash
export SHARK_MASTER_MEM=1g
export HIVE_DEV_HOME="/path/to/hive"
export HIVE_HOME="$HIVE_DEV_HOME/build/dist"
SPARK_JAVA_OPTS="-Dspark.local.dir=/tmp "
SPARK_JAVA_OPTS+="-Dspark.kryoserializer.buffer.mb=10 "
SPARK_JAVA_OPTS+="-verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps "
export SPARK_JAVA_OPTS
export SCALA_VERSION=2.9.3
export SCALA_HOME="/path/to/scala-home-2.9.3"
export JAVA_HOME="/path/to/java-home-1.7_21-or-newer"
# Required only for distributed mode:
export SPARK_HOME="/path/to/spark"
export HADOOP_HOME="/path/to/hadoop-0.20.205.0"
Then you will need to generate Hive's cli test harness for the test code to work
To run Hive's test suite, first generate Hive's TestCliDriver script.
$ cd $HIVE_HOME
$ ant package
$ ant test -Dtestcase=TestCliDriver
Once the JUnit tests start running, you can stop (ctrl+c) the Hive test execution (Shark re-uses test classes from Hive which are compiled in this step). Then you can run sbt/sbt test:compile
inside of $SHARK_HOME.
Shark includes two types of unit tests: Scala unit tests and Hive CLI tests.
You can run the Scala unit tests by invoking the test command in sbt:
$ sbt/sbt test
These tests are defined in src/test/scala/shark
.
To run Hive's test suite, first generate Hive's TestCliDriver script.
$ ant package
$ ant test -Dtestcase=TestCliDriver
The above command generates the Hive test Java files from Velocity templates, and then starts executing the tests. You can stop once the tests start running.
Then compile our test
$ sbt/sbt test:compile
Then run the test with
$ TEST=regex_pattern ./bin/dev/test
You can control what tests to run by changing the TEST environmental variable. If specified, only tests that match the TEST regex will be run. You can only specify a whitelist of test suite to run using TEST_FILE. For example, to run our regression test, you can do
$ TEST_FILE=src/test/tests_pass.txt ./bin/dev/test
You can also combine both TEST
and TEST_FILE
, in which case only tests that
satisfy both filters will be executed.
An example:
# Run only tests that begin with "union" or "input"
$ TEST="testCliDriver_(union|input)" TEST_FILE=src/test/tests_pass.txt ./bin/dev/test 2>&1 | tee test.log
We use a combination of vim/emacs/sublimetext2 and Eclipse to develop Shark. It is often handy to use Eclipse when you need to cross-reference a lot to understand the code or to run the debugger. Since Shark is written in Scala, you will need the Scala IDE for Eclipse to work with.
- Download Eclipse Indigo 3.7 (Eclipse IDE for Java Developers) from http://www.eclipse.org/downloads/
- Install the Scala IDE for Eclipse plugin. See http://scala-ide.org/download/current.html
To generate the Eclipse project files, do
$ sbt/sbt eclipse
Once you run the above command, you will be able to open the Scala project in Eclipse. Note that Eclipse is often buggy and the compilers/parsers can crash while editing your file.
We recommend you turn Eclipse's auto build off, and use sbt's continuous compilation mode to build the project.
$ sbt/sbt
> ~ package
To run Shark in Eclipse, setup a Scala application run configuration for shark.SharkCliDriver class. You will need to set the JVM parameter to change the default JVM heap allocation (-Xms512m -Xmx512m
) since the default heap is too small to run Shark.
To setup Hive project for Eclipse, follow https://cwiki.apache.org/confluence/display/Hive/GettingStarted+EclipseSetup
Delete /root/.ssh/, /home/ec2user/.ssh/, /root/.bash_history
Make sure to add 4 ephemeral volumes to the AMI