Skip to content
pwendell edited this page Oct 10, 2012 · 22 revisions

Shark is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can answer Hive QL queries up to 30 times faster than Hive without modification to the existing data nor queries. Shark supports Hive's query language, metastore, serialization formats, and user-defined functions.

The Shark Query Language

Shark supports a subset of SQL nearly identical to that implemented by Hive. This guide assumes you have some familiarity with Hive, and focuses on the extra functionality included in Shark. Those who need a refresher can refer to the Hive Documentation

Unlike Hive, Shark allows users to exploit this temporal locality by caching their working set of data, or in database terms, to create in-memory materialized views. Common data types can be cached in a columnar format (as Java primitives arrays), which is very efficient for storage and garbage collection, yet provides maximum performance (orders of magnitude faster than reading data from disk). To create a cached table in Spark, simply make a new table ending in "_cached". Below is an example:

CREATE TABLE logs_last_month_cached AS
SELECT * FROM logs WHERE time > date(...);

Once this table has been created, we can query it like any other Hive table.

SELECT page, count(*) c FROM logs_last_month_cached
GROUP BY page ORDER BY c DESC LIMIT 10;

Cached tables are ephemeral - they will fall out of cache once a user's session is terminated. To create long-lived cached tables, use the SharkServer described below.

In addition to caching, Shark employs a number of optimization techniques such as limit push downs and hash-based shuffle, which can provide significant speedups in query processing.

Executing Queries

Shark CLI

The easiest way to run Shark is to start a Shark Command Line Client (CLI) and being executing queries. The Shark CLI connects directly to the Hive Metastore, so it is compatible with existing Hive deployments. Shark executables are available in the bin/ directory. To start the Shark CLI, simply run:

$ ./bin/shark
shark> 

The Shark CLI will only work correctly if the HIVE_HOME environment variable is set (see Configuration).

SharkServer

It is also possible to run a persistent SharkServer which retains session state, such as cached tables, across executions of the Shark CLI. This can be very useful for those hoping to keep a few tables in cache for long periods of time.

To start a SharkServer, run

$./bin/shark --service sharkserver <port>

A client can connect to the server using

$./bin/shark -h <server-host> -p <server-port>

Configuration Options

Shark has two types of configuration options. The first are options that related to a given Shark session. These can be set either in the CLI itself (using set key=val command) or in your existing Hive XML configuration file.

shark.exec.mode     # 'hive' or 'spark', whether to use Hive or Shark based execution
shark.explain.mode  # 'hive' or 'shark', whether to use Hive or Shark based explain
shark.cache.flag.checkTableName # 'true' or 'false', whether to cache tables ending in "_cached"

Advanced users can find the full list of configuration options in src/main/scala/shark/SharkConfVars.scala/

The second type of configuration varialbes are environment vars that must be set for the Shark driver and slaves to run correctly. These are specified in conf/shark-env.sh. A few of the more important ones are described here:

HIVE_HOME     # Path to directory containing patched Hive jars
HIVE_CONF_DIR # Optional, a different path containing Hive configuration files 
SPARK_MEM     # How many much memory to allocate for slaves (e.g '1500m', '5g')