Skip to content

Latest commit

 

History

History
15 lines (8 loc) · 3.25 KB

data_wrangling.md

File metadata and controls

15 lines (8 loc) · 3.25 KB

Capstone Project 1: Data Wrangling

Data Collection

In order to attempt predicting hyperlocal chronic disease prevalence, I needed census tract-level data for target and feature variables. Data for 13 target variables (chronic health outcomes) and 14 features (prevention measures and unhealthy behaviors) on 28,0004 census tracts was collected from the CDC’s 500 Cities: Local Data for Better Health. This data came in the form of a fairly clean CSV file. Only four prevention measures and three health outcomes were missing any data, and the amount of missing data was less than 1% and 2% respectively. However, there were 800 rows that were missing all data and every variable contained outliers.

An additional 420 (of the 16,557) features were chosen from the Census Bureau’s American Community Survey 5-Year Subject Data (ACS) and collected using their API. These features were also calculated at the census-tract level and included an array of demographic data in areas including age, sex, race, education, employment, and economics. This portion of data collection took a considerable amount of time as there were thousands of variables to sort through and the API required the variables to be called one state at a time. Many of the variables in the ACS data were calculated as both population counts and percentages. I ultimately decided to go with counts since percentages could be calculated using the other columns (including total population), if needed.The ACS data proved to be considerably messier with lots of missing and nonsensical values (such as negative population counts). Much like the 500 Cities data, most of the ACS variables also contained outliers.

Data Cleaning

Before exploring the 500 Cities dataset, I needed to reshape it from a long (tidy) format. Although this breaks from the principles of tidy data, this should allow the model to view each census tract as a single observation. In order to reshape the dataframe, I pivoted the variables to columns and then merged the population and geolocation data back in as columns. I dropped 800 rows that were completely empty, bringing the total number of observations down to 27,204. Since the outliers in the variables appeared to be actual data (when dealing with chronic disease, there will be clusters found within census tracts), I imputed the missing values with the median, since the outliers may skew the data.

The nonsensical numbers in the ACS dataset appeared to be consistent, indicating that they were likely truly missing values and not simply typos. All missing values were converted to NaN, and then all rows and columns that were missing more than 55% of the data were dropped. This percentage was chosen somewhat arbitrarily, but I wanted to make sure there was a decent amount of information about each variable before imputing missing data. For the same reasons as the 500 Cities dataset, the missing values were imputed with the mean.

With all data cleaned or dropped, the 500 Cities and ACS datasets were merged with a right outer join on the unique tract IDs. The final dataset contained 26,969 observations and 407 variables. The full data wrangling process can be seen in this notebook: https://github.com/TheeChris/springboard/blob/master/predicting_chronic_disease/chronic_disease_data_wrangling.ipynb