Skip to content

SampleClean Installation Instructions

Sanjay Krishnan edited this page May 27, 2014 · 8 revisions

Requirements:

  • Scala 2.10.4
  • Spark 0.9.1

When starting out, we suggest making a separate directory for SampleClean.

$ mkdir sampleclean_dev
$ cd sampleclean_dev

We will now setup the two requirements: (1) Scala and (2) Spark. SampleClean requires Scala version 2.10.4, so first download and untar Scala 2.10.4

$ wget http://www.scala-lang.org/files/archive/scala-2.10.4.tgz
$ tar xvzf scala-2.10.4.tgz

Then, download and untar Spark 0.9.1

$ wget http://d3kbcqa49mib13.cloudfront.net/spark-0.9.1.tgz
$ tar xvzf spark-0.9.1.tgz

Clone the SampleClean repository:

$ git clone https://github.com/sjyk/sampleclean.git

Once the repository is cloned run the following to initialized the SampleClean patches for BlinkDB and Hive_BlinkDB:

$ cd sampleclean
$ git submodule init
$ git submodule update
$ cd hive_blinkdb
$ git pull https://github.com/sjyk/hive_blinkdb.git

To build the patched Hive version, run:

$ ant package

ant package builds all Hive jars and put them into build/dist directory. If you are trying to build Hive on your local machine and (a) your distribution doesn't have yum or (b) the above yum commands don't work out of the box with your distro, then you probably want to upgrade to a newer version of ant. ant >= 1.8.2 should work. Download ant binaries at http://ant.apache.org/bindownload.cgi. You might also be able to upgrade to a newer version of ant using a package manager, however on older versions of CentOS, e.g. 6.4, yum can't install ant 1.8 out of the box so installing ant by downloading the binary installation package is recommended.

The BlinkDB/SampleClean code is in the sampleclean/ directory. To setup your environment to run BlinkDB/SampleClean locally, you need to set HIVE_HOME and SCALA_HOME environmental variables in a file sampleclean/conf/blinkdb-env.sh to point to the folders you just downloaded. BlinkDB comes with a template file blinkdb-env.sh.template that you can copy and modify to get started:

$ cd sampleclean/conf
$ cp blinkdb-env.sh.template blinkdb-env.sh

Edit sampleclean/conf/blinkdb-env.sh and set and uncomment the following variables for local mode/ First set the scala directory:

#Set the following to the top-level directory containing the scala code
export SCALA_HOME="/path/to/scala-2.10.4"

Then set the spark directory:

#Set the following to the top-level directory containing the spark code
export SPARK_HOME="/path/to/spark-0.9.1"

Set your Java home variable (if it isn't set already)

export JAVA_HOME="/path/to/java-home-1.7_21-or-newer"

Finally, you need to set the location of your Hive binaries (built with ant), this will be in the build/dist folder of hive_blinkdb:

export HIVE_HOME="/path/to/hive_blinkdb/build/dist"

To use these variables in your current environment, run

$ source blinkdb-env.sh

Next, package and publish Spark and BlinkDB/SampleClean

$ cd $SPARK_HOME
$ sbt/sbt publish-local
$ cd <blinkdb directory>
$ sbt/sbt package

Next, create the default Hive warehouse directory. This is where Hive will store table data for native tables.

$ sudo mkdir -p /user/hive/warehouse
$ sudo chmod 0777 /user/hive/warehouse  # Or make your username the owner

You can now start the BlinkDB/SampleClean CLI:

$ ./bin/sampleclean

After starting BlinkDB/SampleClean, it will present a SampleClean prompt:

sampleclean>

We have provided a example dirty dataset of world cities with various text formatting issues and semantic errors. Create a table and load the data in to the table:

sampleclean> CREATE TABLE cities (city string, country string, population string, area string, density string) ROW       
FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';

sampleclean> LOAD DATA LOCAL INPATH 'data/files/world_population.csv' OVERWRITE INTO TABLE cities;

To initialize a 10% "dirty sample" run:

sampleclean> SCINITIALIZE cities_sample (city, country, population, area, density) 
FROM cities SAMPLEWITH 0.1;

Then, set the following variables:

sampleclean> set sampleclean.sample.size=<samplesize>;
sampleclean> set sampleclean.dataset.size=125;

You can get the sample size with:

sampleclean> sccount cities_sample;

To see what the data looks like run:

sampleclean> scshow cities_sample;

To run a first query try:

sampleclean> selectrawsc sum(population) from cities_sample;

This query will return NaN as the there are string formatting issues in the population field.

You can fix these problems (for any attribute) with:

sampleclean> scformat cities_sample population number; 

To review your changes you can run:

sampleclean> scshow cities_sample;
sampleclean> selectrawsc sum(population) from cities_sample;

You will now get results with confidence intervals. There are many other data cleaning primitives to try out, for example, this removes all of the "cities" with "/" in the names:

sampleclean> scfilter cities_sample city not like '%/%'; 

After playing around with the cleaning, if you want to reset to the beginning run:

sampleclean> screset cities_sample;