-
Notifications
You must be signed in to change notification settings - Fork 327
Running Shark on EC2
Shark can be launched on EC2 using the Spark EC2 scripts that come with Spark. These scripts let you launch, pause and destroy clusters that come automatically configured with HDFS, Spark, Apache Mesos, and Shark.
To run a Shark cluster on EC2, first sign up for an Amazon EC2 account on the Amazon Web Services site. Then, download Spark to your local machine:
$ wget http://spark-project.org/files/spark-0.7.2-prebuilt-hadoop1.tgz
$ tar xvfz spark-0.7.2-prebuilt-hadoop1.tgz
The ec2
directory contains the scripts to set up a cluster. Detailed instructions are available in the Spark EC2 guide. In a nutshell, you will need to do:
$ spark-0.7.2/ec2/spark-ec2 -k <keypair-name> -i <key-file> -s <num-slaves> launch <cluster-name>
Where <keypair>
is the name of your EC2 key pair (that you gave it when you created it), <key-file>
is the private key file for your key pair, <num-slaves>
is the number of slave nodes to launch (try 1 at first), and <cluster-name>
is the name to give to your cluster. This creates a cluster on EC2 using a pre-built machine image that has both Spark and Shark.
Login to the master using spark-ec2 login
:
$ ./spark-ec2 -k key -i key.pem login <cluster-name>
Then, launch Shark by going into the shark
directory:
$ shark-0.2/bin/shark-withinfo
The "withinfo" script prints INFO level log messages to the console. If you prefer, you can also leave these out by running ./bin/shark
.
You can use Hive's CREATE EXTERNAL TABLE
command to access data in a directory in S3. First, configure your S3 credentials by adding the following properties into ~/ephemeral-hdfs/conf/core-site.xml
:
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>ID</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>SECRET</value>
</property>
Then create an S3-backed table in bin/shark
as described in the Hive S3 guide:
shark> CREATE EXTERNAL TABLE table_name (col1 type1, col2 type2, ...) <storage info> LOCATION 's3n://bucket/directory/';
spark-ec2
automatically sets up two HDFS file systems: ephemeral-hdfs
, which uses the ephemeral disks attached to your VMs that go away when a VM is stopped, and persistent-hdfs
, which is EBS-backed and persists across pausing and starting the same cluster. By default, Shark stores its tables in ephemeral-hdfs
, which provides a lot of space and is excellent for temporary tables, but is not meant for long-term storage. You can change HADOOP_HOME
in conf/shark-env.sh
to change this, or explicitly upload data to S3 or to the persistent-hdfs
.
Like Hive, Shark stores its tables in /user/hive/warehouse
on the HDFS instance it's configured with. You can either create a table there with CREATE TABLE
and upload data into /user/hive/warehouse/<table_name>
, or load elsewhere in HDFS and use CREATE EXTERNAL TABLE
.
To make it easy to try Shark, we've made available both a small and a large dump of Wikipedia collected by Freebase. They are available in S3 directories spark-data/wikipedia-sample
(40 MB) and spark-data/wikipedia-2010-09-12
(50 GB). Both are stored as tab-separated files containing one record for each article in Wikipedia, with five fields: article ID, title, date modified, XML, and plain text.
Let's first create an external table for the smaller sample dataset:
shark> CREATE EXTERNAL TABLE wiki_small (id BIGINT, title STRING, last_modified STRING, xml STRING, text STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3n://spark-data/wikipedia-sample/';
Now we can query it as follows:
shark> SELECT COUNT(1) FROM wiki_small WHERE TEXT LIKE '%Berkeley%';
shark> SELECT title FROM wiki_small WHERE TEXT LIKE '%Berkeley%';
We can also cache the table in memory by using the CREATE TABLE AS SELECT
statement with "shark.cache" enabled in the table properties:
shark> CREATE TABLE wiki_small_in_mem TBLPROPERTIES ("shark.cache" = "true") AS SELECT * FROM wiki_small;
And then query the cached data for faster access:
shark> SELECT COUNT(1) FROM wiki_small_in_mem WHERE TEXT LIKE '%Berkeley%';
Or, we can cache just a subset of the table, such as just two of the columns (or any other SQL expression we wish):
shark> CREATE TABLE title_and_text_in_mem TBLPROPERTIES ("shark.cache" = "true") AS SELECT title, text FROM wiki_small;
Finally, you can try the same commands on the full 50 GB dataset by using:
shark> CREATE EXTERNAL TABLE wiki_full (id BIGINT, title STRING, last_modified STRING, xml STRING, text STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3n://spark-data/wikipedia-2010-09-12/';
To process this larger dataset quickly, you'll probably need at least 15 m1.xlarge
EC2 nodes in your cluster. (Pass -s 15 -t m1.xlarge
to spark-ec2
for example.) In our tests, a 15-node cluster launched with these settings can scan the dataset from S3 in about 80 seconds, and can easily cache the text
and title
columns in memory to speed up queries to about 2 seconds.