-
Notifications
You must be signed in to change notification settings - Fork 4
SampleClean Installation Instructions
Requirements:
- Scala 2.10.4
- Spark 0.9.1
When starting out, we suggest making a separate directory for SampleClean.
$ mkdir sampleclean_dev
$ cd sampleclean_dev
We will now setup the two requirements: (1) Scala and (2) Spark. SampleClean requires Scala version 2.10.4, so first download and untar Scala 2.10.4
$ wget http://www.scala-lang.org/files/archive/scala-2.10.4.tgz
$ tar xvzf scala-2.10.4.tgz
Then, download and untar Spark 0.9.1
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-0.9.1.tgz
$ tar xvzf spark-0.9.1.tgz
Clone the SampleClean repository:
$ git clone https://github.com/sjyk/sampleclean.git
Once the repository is cloned run the following to initialized the SampleClean patches for BlinkDB and Hive_BlinkDB:
$ cd sampleclean
$ git submodule init
$ git submodule update
$ cd hive_blinkdb
$ git pull https://github.com/sjyk/hive_blinkdb.git
To build the patched Hive version, run:
$ ant package
ant package
builds all Hive jars and put them into build/dist
directory. If you are trying to build Hive on your local machine and (a) your distribution doesn't have yum or (b) the above yum commands don't work out of the box with your distro, then you probably want to upgrade to a newer version of ant. ant >= 1.8.2 should work. Download ant binaries at http://ant.apache.org/bindownload.cgi. You might also be able to upgrade to a newer version of ant using a package manager, however on older versions of CentOS, e.g. 6.4, yum can't install ant 1.8 out of the box so installing ant by downloading the binary installation package is recommended.
The BlinkDB/SampleClean code is in the sampleclean/
directory. To setup your environment to run BlinkDB/SampleClean locally, you need to set HIVE_HOME and SCALA_HOME environmental variables in a file sampleclean/conf/blinkdb-env.sh
to point to the folders you just downloaded. BlinkDB comes with a template file blinkdb-env.sh.template
that you can copy and modify to get started:
$ cd sampleclean/conf
$ cp blinkdb-env.sh.template blinkdb-env.sh
Edit sampleclean/conf/blinkdb-env.sh
and set and uncomment the following variables for local mode/ First set the scala directory:
#Set the following to the top-level directory containing the scala code
export SCALA_HOME="/path/to/scala-2.10.4"
Then set the spark directory:
#Set the following to the top-level directory containing the spark code
export SPARK_HOME="/path/to/spark-0.9.1"
Set your Java home variable (if it isn't set already)
export JAVA_HOME="/path/to/java-home-1.7_21-or-newer"
Finally, you need to set the location of your Hive binaries (built with ant), this will be in the build/dist folder of hive_blinkdb:
export HIVE_HOME="/path/to/hive_blinkdb/build/dist"
To use these variables in your current environment, run
$ source blinkdb-env.sh
Next, package and publish Spark and BlinkDB/SampleClean
$ cd $SPARK_HOME
$ sbt/sbt publish-local
$ cd <blinkdb directory>
$ sbt/sbt package
Next, create the default Hive warehouse directory. This is where Hive will store table data for native tables.
$ sudo mkdir -p /user/hive/warehouse
$ sudo chmod 0777 /user/hive/warehouse # Or make your username the owner
You can now start the BlinkDB/SampleClean CLI:
$ ./bin/sampleclean
After starting BlinkDB/SampleClean, it will present a SampleClean prompt:
sampleclean>
We have provided a example dirty dataset of world cities with various text formatting issues and semantic errors. Create a table and load the data in to the table:
sampleclean> CREATE TABLE cities (city string, country string, population string, area string, density string) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';
sampleclean> LOAD DATA LOCAL INPATH 'data/files/world_population.csv' OVERWRITE INTO TABLE cities;
To initialize a 10% "dirty sample" run:
sampleclean> SCINITIALIZE cities_sample (city, country, population, area, density)
FROM cities SAMPLEWITH 0.1;
Then, set the following variables:
sampleclean> set sampleclean.sample.size=<samplesize>;
sampleclean> set sampleclean.dataset.size=125;
You can get the sample size with:
sampleclean> sccount cities_sample;
To see what the data looks like run:
sampleclean> scshow cities_sample;
To run a first query try:
sampleclean> selectrawsc sum(population) from cities_sample;
This query will return NaN
as the there are string formatting issues in the population field.
You can fix these problems (for any attribute) with:
sampleclean> scformat cities_sample population number;
To review your changes you can run:
sampleclean> scshow cities_sample;
sampleclean> selectrawsc sum(population) from cities_sample;
You will now get results with confidence intervals. There are many other data cleaning primitives to try out, for example, this removes all of the "cities" with "/" in the names:
sampleclean> scfilter cities_sample city not like '%/%';
After playing around with the cleaning, if you want to reset to the beginning run:
sampleclean> screset cities_sample;