Skip to content

cyberknight42/HW3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HW3

CS150 Database and Data Mining hw3 (*adapted from UCB CS186)

Please type your Chinese name and ID.

Your Name:

Your ID:

Due: Friday May 17, 2019; 11:59 PM

Note: This homework is to be done individually!

You will implement the K-means clustering algorithm in Spark and use it to analyze political campaign data from the Federal Election Commission.

Introduction

In this assignment, you'll look at data from the Federal Election Commission (FEC), which contains details on the campaign finances of the current candidates for the presidency of the United States.

First, you'll use Spark DataFrames to run some basic analytical queries on your data. You'll see that Spark is able to behave like a traditional database management system, and its API of query operators easily supports queries handled by other SQL systems in a scalable fashion.

While Spark evidently provides support for basic queries, the true power of its programming model is in its flexibility supporting more complicated, iterative tasks. As a result, you'll additionally use Spark operators to implement K-means clustering, a widely-used algorithm in machine learning for categorizing and classifying data.

Finally, you'll combine algorithm and data to perform some scalable machine learning. Hopefully, you'll be able to see some trends in campaign contributions.

Context

You'll need to piece together the data, system, and the algorithm to make this all work.

The Data: Federal Election Commission Finance Data

You will be performing data analytics on about 432 MB of FEC data. This data describes the finances of candidates running for election in 2016, and it details contributions from individual and organizations to campaigns disclosed to the FEC.

You can also download a copy of the data we used [here] (http://eecs.berkeley.edu/~jegonzal/fec_2016_3_25.tar.bz2). Each table consists of a header .csv file and a data .txt file. For example, this is the Committee Master header file (cm_header_file.csv)

MTE_ID,CMTE_NM,TRES_NM,CMTE_ST1,CMTE_ST2,CMTE_CITY,CMTE_ST,CMTE_ZIP,CMTE_DSGN,CMTE_TP,CMTE_PTY_AFFILIATION,CMTE_FILING_FREQ,ORG_TP,CONNECTED_ORG_NM,CAND_ID

and this is the first four rows of the corresponding data file (cm.txt)

C00000059|HALLMARK CARDS PAC|BROWER, ERIN MS.|2501 MCGEE|MD#288|KANSAS CITY|MO|64108|U|Q|UNK|M|C|HALLMARK CARDS, INC.|
C00000422|AMERICAN MEDICAL ASSOCIATION POLITICAL ACTION COMMITTEE|WALKER, KEVIN|25 MASSACHUSETTS AVE, NW|SUITE 600|WASHINGTON|DC|20001|B|Q||M|M|AMERICAN MEDICAL ASSOCIATION|
C00000489|D R I V E POLITICAL FUND CHAPTER 886|TOM RITTER|3528 W RENO||OKLAHOMA CITY|OK|73107|U|N||Q|L|TEAMSTERS LOCAL UNION 886|
C00000547|KANSAS MEDICAL SOCIETY POLITICAL ACTION COMMITTEE|C. RICHARD BONEBRAKE, M.D.|623 SW 10TH AVE||TOPEKA|KS|66612|U|Q|UNK|Q|T||

Here's a quick summary of the tables:

  • cm contains committee information
  • cn contains candidate information
  • ccl contains linkage between committees and candidates
  • itcont contains individual contributions to committees
  • itpas2 contains contributions between committees
  • itoth links between committees

Read the specification at the FEC site to further familiarize yourself with the details of these formats.

The System: Spark SQL (DataFrames)

You'll be using Spark SQL to process the FEC data. Essentially, Spark SQL is an extension to Spark which supports and exploits database abstractions to allow applications to manipulate data both in the context of SQL and in the context of Spark operators.

Spark DataFrames are a specialization of the RDD which additionally organize RDD records into columns. In fact you can always access the underlying RDD for a given dataframe using dataframe.rdd. Each DataFrame represents a SQL table, which support as methods logical SQL operators composable with other Spark and user-defined functions.

For instance, suppose we have a SQL table containing two-dimensional points,

CREATE TABLE points (x int, y int)

and we want to sample 20% of the x-values in the fourth quadrant into a table Samples. In SQL we could write

CREATE TABLE samples AS
SELECT x
FROM points
WHERE x > 0
 AND y < 0
 AND rand() < 0.2;

Alternatively, we can read this table into a Spark DataFrame, points

points = some_loading_function(points_data)
points.registerTempTable("points")

We then can compute, save, and print this sampling of points

samples = points.where(points.x > 0)\
    .where(points.y < 0)\
    .sample(False, 0.2)\
    .select('x')\
    .registerTempTable("samples")

samples.show()

In fact, due to this duality, DataFrames support direct SQL queries as well. To run the above query on a DataFrame, we write

sql.sql("SELECT x FROM points WHERE x > 0 AND y < 0 AND rand() > 0.2")\
    .registerTempTable("samples")

Play around with DataFrames to get a sense of how they work. You can read the official documentation for more details.

The Algorithm: K-means Clustering

Clustering is a common task in machine learning in which similar objects, represented as vectors, are grouped into broad categories. The K-means clustering algorithm is one method for performing this task, in which similar objects are placed into k groups according to their distance from some designated cluster center.

Here is a sketch of the algorithm:

def k_means(data, k):
    # initialization
    new_centers = initialize_centers_from_data(data)

    # main iteration
    while not_converged(...):
        old_centers = new_centers

        # initialize statistics for the iteration
        cluster_sizes = new_vector(k)
        sums = new_matrix(k, d)

        # compute statistics for new centers
        for vector in data:
            center = nearest_center(old_centers, vector)
            cluster_sizes[center] += 1.0
            sums[center] += vector

        # compute new centers from statistics
        new_centers = [s / n for (n, s) in zip(cluster_sizes, sums)]

   return new_centers

Notice that the algorithm's main iterative loop involves the slow improvement of the currently-found best centers. The algorithm must thus initialize these centers somehow before the first iteration runs. The traditional K-means algorithm initializes its centers uniformly at random, but we can improve this first guess by picking the first centers cleverly, using the K-means++ optimization. In K-means++, we pick centers "far" away from previous centers to improve the accuracy and speed of convergence of our algorithm.

A Useful Library: NumPy

NumPy is a Python library used for scientific computation. It implements many useful, common functions used in numerical applications:

# a vector of length k
arr = np.zeros(k)

# a matrix of size k * d
mat = np.zeros((k, d))

# python list as NumPy vector
numbers_arr = np.asarray(python_list)

# minimum of a list
smallest = np.min(numbers_arr)

# suppose we have two lists of points, point_arr1 and point_arr2
# then get the distances (2-norm) between each pair of points
distances = np.linalg.norm(point_arr1 - point_arr2, axis=1)

# alternatively, if we want to get the distances
# between all the points in an array point_arr
# and a single point point0
distances = np.linalg.norm(point_arr - point0, axis=1)

It may help to use the NumPy documentation as a source of reference.

Specification

Time to put your database skills to use!

Note: We will not grade code that is modified outside of the cells bracketed by Begin Student Code Here and End Student Code Here markers. Make sure all the changes you want to submit lie between those markers. DO NOT REMOVE THE Begin Student Code Here AND End Student Code Here CELLS.

1. DataFrames

Your first task is to carry out basic data analytics queries against the FEC data.

1a. File Format Wrangling

We must convert the raw FEC files into Spark's DataFrame format before we can manipulate them with Spark methods. Implement load_dataframe, which loads a pair of FEC files into a DataFrame.

The paths specificed by the file_struct objects are rooted at the location specified by root_path. We've provided a load_header function which reads in the .csv header file of a table and returns a list of column names. Use these to properly specify the schema for the DataFrame table.

1b. Basic Analytics Queries

Now that our files have been loaded into Spark, we can use DataFrames to perform some basic analytical queries - similar to what we can do with SQL. Observe that we can either write raw SQL queries, or we can use DataFrame methods to directly apply SQL operators. For this question, you may use either approach.

Answer the following questions by writing queries:

  1. What are the ID numbers and Principal Candidate Committee numbers of the 4 current presidential candidate front-runners (Hillary Clinton, Bernie Sanders, Donald Trump, and Ted Cruz)? Hint: Take a look at the output of the demonstration. What values do we want these columns to have?
  2. How many contributions by individuals has each front-runner's principal campaign committee received? Hint: Which table might you want to join on? Do not filter by ENTITY_TP.
  3. How much in total has each front-runner's principal campaign committee received from the contributions in Q2?
  4. What are the committees that are linked to each of the front-runners? Hint: How many tables will we need for this?
  5. How many contributions by committees has each front-runner received? Hint: Do not filter by ENTITY_TP.
  6. How much in total has each front-runner received from the contributions in Q5?

Note: The penultimate line of each cell describes the schema for the output we're expecting. If you change this line, make sure that your output's schema matches these column names.

2. K-means Clustering

Time to do more advanced analysis. Implement the K-means algorithm on DataFrames.

We'll start with some toy data first before we return to the more complex campaign data. The toy data is just a collection of 2D points; you can play around with it to experiment. (This data is stored in the form of a Parquet file for efficiency; you don't need to worry about the specifics of it except that they appear as directories in the file system and you can load them with sql.read.parquet.)

Hint: NumPy will come in handy here for computing things like distances, or for generating pseudo-random numbers.

Note: Remember, we are using DataFrames to ensure that our algorithm scales with the size of data input. Do not attempt to read large amounts of data into memory (e.g. into Python lists)! Failure to do so may cause your code to run very slowly, and you may lose points as a result.

2a. K-means++ Initialization

First, you will initialize the centers using the K-Means++ algorithm. Since we want to sample for centers in a scalable fasion, we will use Distributed Reservoir Sampling to randomly select a center.

Implement initialize_centers_plus. We've provided the signatures of some functions which will most likely be helpful to you. choose_partition_center, pick_between_centers, and nearest_center may be useful helper functions in your solution to initialize_centers_plus. When you've finished implementing this function, you can run it against our set of toy data. In particular, the initial centers which are selected by K-means++ should be far away from each other.

2b. Main Loop

Now, using our intialization code from before, we will implement the main loop in K-means clustering (and the rest of the function, too).

Provide an implementation of k_means. We've given you has_converged, a useful function to determine when the main loop of the K-means algorithm has finished converging. In addition, you may find that filling out compute_new_center_statistics and add_statistics will help you complete your task.

You can test your implementation of the completed K-means algorithm against the toy data. On convergence, the algorithm should produce K clusters which accurately represent the partitioning of the data into K partitions.

3. Geographical Contribution Clustering

Finally, we can try out our K-means algorithm on our original set of campaign finance data: we'll attempt to categorize campaign contributions by geographic location using clusters.

You've just implemented the K-means algorithm, and we've already implemented code to load the zip code data into DataFrames, so all you have to do is find the right value of k to use. Essentially, we want a value of k which minimizes the error from the resulting clusters; at the same time, a larger value of k carries with it the risk of overfitting. We define the error of the procedure to be the mean of distances from points to their cluster center.

We can get a rough idea of what value this is by trying many values and plotting it. The "elbow method" is a good heuristic for an optimal value. Using this heuristic, what is a reasonable value of k? Try values of k from 2 to 30 in increments of 4. To speed up these experiments set the convergence epsilon threshold for k_means to 0.001.

After determining this value, you can finally visualize campaign contributions to different candidates as geographical clusters. Congratulations on finishing the last project in this course!

Note: Part 3 may take a long time but no more than 20 mins.

Testing

We've provided some toy data in the notebook to test your implementation of the K-means algorithm. Of course, you are advised to write your own tests to catch bugs in your implementation.

Submission

Please submit your most recent version of the homework code including all the py files (not including data) to this repository.

About

CS150 Database and Data Mining hw3

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published