Creates Hive database based on the Airline On_time dataset ( that can be explored using a Tableau Dashboard.
The scripts should be run from a machine :
- with access to the Hadoop cluster and
- having HDFS and HIVE (beeline) clients installed and configured
By default the script will:
Download the entire dataset (from 1998 to 2008). To limit the scope, please adjust the START and END parameters
Use localhost to connect to the Hive Server Please adjust the HIVE_HOST parameter with the location of Hive Server 2
Use Tez as the Default execution engine. To use LLAP (recommended) please adjust the LLAP parameter to true
Start by executing It will download the Data from the site and create a staging table on top of the data
Then run It will will create an optimized partitionned Hive table using ORC
A 3rd script is available to create a denormalized table backed by Druid : NOTE: Druid must be proporly installed and configured ( including Hive integration ) before running this script.