How to install Spark with Hive support and run tpcds queries:
Install Hadoop and configure hdfs
Compile Spark with hive support
git clone (eventually move to the desired tag)
./ --name spark-with-hive --tgz -Psparkr -Phadoop-2.6 -Dhadoop.version=2.6.2 -Phive -Phive-thriftserver -Pyarn (check that the version of hadoop you are using is the same used in the command to build spark, this takes a while...)
Install Spark in some folder (e.g. /opt/spark extracting the generated archive)
Install hive
- Configure hive to use mysql as metastore
- skip the part of the configuration about the "hive.metastore.uris" parameter
- Add in the folder with the spark configuration the same hive-site.xml file used in hive configuration (this tells spark where to look for the metastore)
Generate tpcds dataset (I'm using
- if not alreadygenerated on hdfs, put it there
Load the tables in hive to setup the metastore using script:
- e.g. ./ 2 /data
Build the spark application with the embedded queries
- git clone
- cd tcp-ds
- mvn clean install
run the query submitting the application to spark
- spark-submit --master spark://clusterino1:7077 --class it.polimi.spark.tpcds.Query target/uber-tcp-ds-0.0.1-SNAPSHOT.jar -i /data/2 -o /output -db tpcds_text_2 -id R1
- the db is the one created in step 5, the name is "tpcds_text_"+
- custom queries can be executed using -q "query text" instead of the -id argument
To change the dataset size:
- Repeat Step 4
- Repeat Step 5 (optionally dropping the other database)
- Repeat Step 7 (as many time as needed with the required queries)