Skip to content

Using regression models to project food insecurity rates in the US, by county for the year 2020 (Flatiron Capstone Project)

Notifications You must be signed in to change notification settings

khyateed/projecting-food-insecurity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Projecting Food Insecurity Rates in the US by County

Flatiron School Data Science Capstone Project
By Khyatee Desai
Jan. 6, 2021

Overview

Food insecurity (FI) is defined as the inability to consistently and reliably obtain enough food, due to a lack of resources. Food insecurity rates are one of the primary metrics used in determining how resources get distributed to communities through government assistance programs such as the Supplemental Nutrition Assistance Program (SNAP,) through non-profits organizations such as Feeding America, and through mutual-aid projects.

These organizations tend to prioritize resource allocation in the form of food, money, and financial relief to communities that demonstrate the greatest need in a quantifiable way. The goal of this project is to predict future food insecurity rates at the county level to aid in resource allocation and to better direct preventative measures to those areas before the issue gets worse.

Business Problem

Food insecurity, is not a phenomena that can be easily predicted, or even measure in real-time for two reasons: first, it is dependent upon a number of interwoven factors that can grow and change in unpredictable ways. Second, FI rate is a measure that is often determined retroactively, based on how food-assistance programs end up being utilized, and how survey respondents end up reporting their past food-related needs.

Currently, food insecurity rates for the past year are determined through the Current Population Survey (CPS,) which is a nationally representative survey conducted by the Census Bureau for the Bureau of Labor Statistics. In December of each year, 50,000 households respond to this survey, answering questions related to income, food spending, and the use of government and community food assistance programs, all of which are factored together to produce an annual food insecurity rate estimation.

While this annual survey provides incredibly valuable insight into the past needs of people at the community level, it does not inherently have the predictive capability to determine which communities will face the greatest impacts from food insecurity in the future, especially in the face of a worldwide pandemic that drastically impacts employment status, businesses, houselessness, and mobility.

Therefore, the goal of this capstone project is to aid in a proactive solution to food insecurity, by using regression models to project current FI rates using closely linked indicators such as houselessness, food cost, race, and employment status.

Data

The datasets used for the MVP model come from six different sources and are each broken down into yearly datasets spanning the years 2009-2020:

DATASET SOURCE DESCRIPTION
Food insecurity data Feeding America Map the Meal Gap Study This dataset contains data on food insecurity rates in the US by county, from 2009-2018.
Unemployment Data Bureau of Labor Statistics This dataset contains yearly data on the Labor Force of each US county, for the years 2009-2020. The files include data on total workforce, and unemployment rates.
Demographic Data United States Census Bureau County Population Estimates This dataset contains columns on demographic information such as gender, race, and age, for each US county, for years 2010-2019.
Houselessness Data US Dept. Housing & Urban Development (HUD) Point in Time Estimates This dataset contains data on houselessness rates in the US by Continuum of Care (CoC,) for the years 2009-2019.
Rent Prices Zillow Observed Rent Index This dataset contains monthly data from Zillow.com on 1-bedroom rent prices by zipcode, for the years 2014-2020. The data is produced using the Zillow Observed Rent Index (ZORI,) which is a smoothed measure of the typical observed market rate rent across a given region.
Food Business Data US Census Bureau County Business Patterns This dataset contains data on all businesses in the US at the County level, for years 2009-2018. The dataset is used below to get Food Retail data, which includes grocery stores, wholesalers, and restaurants.
Current Population Survey US Census Bureau CPS Datasets The CPS and Basic Monthly CPS surveys provide information on households such as median income, household size, disability status, and healthcare coverage.

Preprocessing

There are 55 datasets used to produce features for the MVP model, each of which must undergo intensive pre-processing before EDA and modeling can occur.

  • import each dataset, which contains a folder of CSV files for each year, and standardize any differences between the years
  • preliminary cleaning, such as renaming columns for interpretability and dropping features not needed for EDA or modeling, such as unique ID numbers or metrics that are irrelevant to this project.
  • map coded columns to their corresponding columns using data dictionaries
  • reformat GEOID's (also called FIPS codes) for each observation at the County level - FIPS code will be used as the primary/foreign key for each dataset to be joined together.
  • impute missing values in the Feeding America dataset for years 2011-2013 by using averaged change between 2010 and 2014 data.
  • replace special characters such as "--" with null
  • add a "Year" column to each yearly dataset, and then vertically concatenate all the yearly datasets together
  • horizontally concatenate all 6 dataframes into a main dataset for all years and all features

The following images depict the 2009 Businesses dataset before and after data cleaning.

Before cleaning: img

After cleaning: img

Data Dictionary

This table describes each of the main features of the final cleaned dataset (before feature engineering) which is used for EDA and modeling:

FEATURE DESCRIPTION DATA TYPE
FIPS County level identification code (also referred to as GEOID str
Rent Average 1-Br Rent price float
Year Year data was collected str
coc_number Continuum of Care (CoC) number - corresponds to houselessness data str
Houseless_rate Percent of population within the county that is houseless float
Sheltered_rate Percent of the population of a county that is houseless and resides in a shelter float
Unsheltered_rate Percent of the population of a county that is houseless and does not reside in a shelter float
State US State name str
County County name str
TOT_POP Population count within a county float
TOT_MALE Count of all males float
TOT_FEMALE Count of all females float
TOT_WHITE Count of all white people within a county float
TOT_BLACK count of all Black people within a county float
TOT_NATIVE count of all Indigenous people within a county float
TOT_ASIAN count of all Asian people within a county float
TOT_PACIFIC count of all Pacific Islander people within a county float
TOT_LATINX count of all LatinX people within a county float
State/County State/County combination str
FI Rate Percent of population within a county that is food insecure (target variable) float
Low Threshold Type Low threshold food assistance programs by State str
High Threshold Type High threshold food assistance programs by State str
Cost Per Meal Average cost per meal float
Num_wholesale Number of wholesale business within a county float
Num_restaraunts Number of restaurants and cafes within a county float
Num_grocery Number of grocery stores and markets within a county float
Total_workforce Total number of people within a county who are able to legally work float
Employed Total number of people within a county who are employed float
Unemployed Total number of people within a county who are unemployed float
Unemployment_rate Percent of the total workforce within a county that is unemployed float
hh_med_income Household median income float
pop_disabled Number of respondents who are disabled float
pop_hs_grad Number of respondents who graduated highschool float
pop_bachelors Number of respondents with a bachelors degree float
pop_grad_degree Number of respondents with a graduate degree float
pop_disabled Number of respondents who are disabled float
pop_disabled Number of respondents who are disabled float

For a detailed walkthrough of the cleaning process used to derive this cleaned dataset, please view data cleaning notebooks part one and part two.

Exploratory Data Analysis

This project focuses on projecting future FI rates, at the county level. This means that both time and geography are important components of understanding the data. The EDA notebook addresses 3 questions to help gain a better understanding of food insecurity, and how it relates to time, geography, and its closely linked indicators.

Question 1. How have metrics such as unemployment, houselessness, and food insecurity rates changed over time?

The first EDA question explores how different features from the original dataset have changed over time. Because this project ultimately aims to project food insecurity rates for 2020, it is important to get an understanding of how features change over time, and whether they follow any discernible trends.

Each feature group is scaled to the same magnitude, and visualized using lineplots or/and barcharts. The construction of these multi-variable charts is accomplished using two functions: lineplot() and barchart(), both of which can be found in the src folder of this repository.

Food Insecurity, Houselessness, and Unemployment Rates, by Year

The charts below aim to visualize and compare the trends between these three scaled features over time:

img

The lineplots above indicate there is a strong visible correlation between FI rate (target variable,) unemployment, and houseless rates. This is important to keep in mind, moving into modeling and inspecting feature importance. Although the data is not available for the MVP, the strong relationship between these three features is also a strong indicator that FI rates and houselessness will drastically increase in 2020, alongside unemployment rate.

Food Insecurity in Different Communities, by Year

The following lineplots aim to look at how average food insecurity rates differ across different racial communities over time:

img

The above lineplots indicate that average FI rates are highest in Black communities, followed by Indigenous and LatinX. While the general trend is downwards, some communities encounter anomalous movement, such as predominantly LatinX communities, which face an upward spike in FI Rates moving into 2018. It is important to note that these rates are not directly tied to racial groups, but rather counties with different predominant racial demographics.

Average Unemployment Rates in Different Communities, by Year

The following plots inspect average unemployment rates in different racial communities, by year.

img

The above lineplots indicate that predominantly Black and Indigenous communities face higher average rates of unemployment, followed by LatinX communities. The general trend of average unemployment rate is downwards over time.

Question 2. How do unemployment, race, and food insecurity rates vary geographically?

The purpose of exploring this question is to gain a better understanding of the geographic component of food insecurity and related features.
This question was explored by producing several chroropleth maps, which are heatmaps that visualize a particular feature across a geographic area. The maps are generated using a function called choropleth() which takes a dataframe, feature, year, color palette, and title, and generates a choropleth map.

Unemployment Rate, by County, 2019 and 2020

img img

Unemployment rates were at a record low in 2019 before the pandemic started, and shot up significantly in 2020, as can be seen by the maps above, which show a large increase in unemployment rates across the country, and especially on the coasts.

PoC Population Percentage by County, 2010 and 2020

These maps help to visualize the areas of the United States whose populations are predominantly non-white.

img img

The above maps indicate that while the majority of US counties are predominantly white, there are significant areas, such as much of the southern states and portions of the Southwest that are predominantly communities of Color. There does not seem to be a significant change in this demographic spread between 2010 and 2018.

Food Insecurity Rates by County, 2009 and 2018

The following maps visualize FI Rates by county in 2009 and 2018.

img img

The maps above indicate lower average food insecurity rates over time in some areas, such as the Pacific Northwest, and significantly higher rates in Southern counties. This is important to note, given that while national averages for FI rates have gone down over time, these maps indicate that in some communities, this problem has actually been exacerbated over time.

When compared to the maps above which visualize communities of color, it is easy to see that the counties which are most severely effected by food insecurity very closely parallel the areas of the country that are predominantly non-white.

Question 3. What is the relationship between food insecurity rate and other factors such as rent prices, unemployment, houselessness, and race?

The purpose of this final EDA question is to determine how the other features in the dataset, such as average rent prices and houselessness rates, directly relate to food insecurity rates. These observations will be important to keep in mind moving into the modeling process and looking at coefficients/feature importances.

FI Rate Distribution, by Rent Prices

The below histograms show the distribution of FI rates in areas with 1-br apartment rent prices above $2000, vs areas with rent prices below $1000.

img

The above histograms indicate a visually significant difference between the two. Areas with lower rent prices have generally higher FI Rates with a mean around 0.14, while areas with higher rent prices have lower FI Rates with a mean around 0.08. This makes sense, given that "wealthier" areas likely less from food insecurity.

FI Rate Distributions, by Unemployment Rate

The following histograms plot FI rate distribution, by areas with unemployment rates above %15, vs areas with unemployment rates below %5.

img

The above histograms show a visually significant difference in FI rates between areas based on unemployment rates. Areas with high unemployment rates have higher average FI Rates with a mean around 0.23, while areas with low unemployment rates have lower average FI rates averaging around 0.11. This is pretty intuitive, given that being unemployed is a likely cause of being unable to afford food.

FI Rate Distribution, by Houseless Rates

The below histograms plot food insecurity distributions of areas with a houseless rate above %1, and areas with a houseless rate below %0.01.

img

The above histograms indicate a visually significant difference between the two distributions. Areas with higher houselessness rates have a higher average FI Rate of around 0.16, versus areas with low houselessness averaging around 0.08. This makes sense, given that being houseless possibly indicates a lack of money, which lacks to an inability to buy enough food.

FI Rate Distribution, by Race

The following histograms plot food insecurity rate distributions for areas of predominantly different races.

img

The above histograms indicate a visually significant different in FI rates for different racial communities. Areas that are predominantly Black are shown to have a much higher average FI rate of 0.29, versus predominantly white or LatinX communities with averages around 0.14. This is important to keep in mind, when considering feature importance and factors that are significant indicators for food insecurity.

Data Analysis Summary

This data analysis explores the above questions through the lens of time and geography. From the analysis, we see close relationships between FI Rates, unemployment rates, houseless rates, and race, both in terms of which areas are geographically impacted most, as well as how these different features have changed over time.

Communities of color, particularly Black communities, as well as communities with high unemployment and houselessness are shown through this analysis to have the highest FI rates. In addition, while many features such as FI rate, unemployment, and houselessness have decreased over time on average, the above choropleth maps indicate that these factors have actually been exacerbated in certain geographic areas over time.

The findings of this data exploration are meaningful when determining which communities and geographic areas are most at-risk for high food insecurity rates, and should be used both in terms of allocating resources to these communities, as well as taking proactive measures to address the root cause of these issues that disproportionately affect certain communities over others.

Feature Engineering

Before beginning modeling, several new features are derived from the original dataset features in the feature_engineering.ipynb notebook. These engineered features are saved using Pickle, and imported into the modeling notebook later on.

Demographics Percentage Breakdowns

Demographic percentages make it possible to compare the demographic distribution of different counties, while accounting for population density differences.
Ex. df['Percent_male'] = df['TOT_MALE']/df['TOT_POP']

Interaction Features

An interaction feature is created for each combination of continuous features, and the "best" performing features are added to the main dataframe, by running them in a simple OLS model and comparing R2 values. Interaction Features are derived using the following calculation:
df[feature1+'_X_'+feature2] = df[feature1] * df[feature2]
Interactions help to account for coexisting features - for example, someone of identity A and identity B may have a much greater chance of being food insecure than someone of only identity A or identity B

Log Transformations

Log features are created by taking the natural log of a feature, and adding this new feature to the dataframe. Log transformations can be useful to better model the shape of data that has very high outliers, by penalizing high values more than smaller ones.

img img

The above images show scatter plots of Rent price vs. FI Rate, and Log Rent price vs. FI Rate. The log transformation penalizes the outliers present within the rent data, allowing the shape of the data to be better interpreted.

Dummy Variables

Dummy variables are created for the categorical variables High and Low threshold, by using the Pandas get_dummies() function to turn them into 1's and 0's.

Feature Selection and Modeling

The modeling process uses final cleaned data with engineered features, produced and pickled from feature_engineering.ipynb. Each model produced for the MVP is a simple linear regression, using different features determined through a variety of feature selection techniques.

Model 1: All Features

The first model uses all features in the cleaned dataset, as well as all engineered features.

R2 RMSE
Train 0.652 0.0258
Test 0.638 0.0260

Model 1 had a cross validation R2 score of 0.587, with 5 folds.

Model 2: Remove Outliers

The EDA notebook includes a section on inspecting which features have the largest outliers, using box and whisker plots. The following plot focuses on the features with highest outliers: 'TOT_POP', 'TOT_MALE', 'TOT_FEMALE', 'TOT_BLACK', 'TOT_ASIAN','TOT_WHITE', 'TOT_LATINX','Total_workforce', and 'Employed'

img

For model 2, each feature with visibly high outliers is iterated on, and if the feature has observations greater than +/- 2 standard deviations of the mean, it reduces outliers to +/- 2 standard deviations from the mean of that feature.
The model is then re-run using the same process as model 1, now on data with reduced outliers:

R2 RMSE
Train 0.760 0.0214
Test 0.757 0.0215

Model 2 had slightly better R2 scores, but lower RMSE, and more importantly, a lower cross validation R2 score of 0.503, with 5 folds.

Model 3: Remove Multicollinear Features

Multicollinearity occurs when features in the dataset are correlated to one another, rather than only to the target variable. Multicollinearity can create noise within the data, and is therefore important to address during the feature selection process. A heatmap is produced from correlation matrix in the EDA notebook, which highlights that houseless rate, sheltered/unsheltered rates, and all of the race-related features are highly correlated with one another. img

For model 3, Variance inflation factor (VIF) is used to remove highly correlated features. VIF helps quantify the severity of multicollinearity in a dataset by comparing variance of the model using different terms. High VIF scores indicate severe multicollinearity, so model 3 uses only features with VIF scores below 10.

R2 RMSE
Train 0.450 0.032
Test 0.473 0.0318

Model 3 had a cross validation R2 score of 0.385, and the worst scores for both R2 and RMSE.

Model 4: K-Best Features

Model 4 uses SelectKBest() to determine the best k features to use in a model. A variety of k values are looped through to determine the k that yields the best performing model. The best k value is determined to be 93, so model 4 is run with those features.

R2 RMSE
Train 0.791 0.0199
Test 0.791 0.0200

Model 4 performs significantly better than others, with an average R-squared value of .791, an RMSE of .0199, and an average cross validation score of 0.743.

Model 5: Recursive Feature Elimination

Recursive feature elimination is used to iteratively remove features and rerun the model, in order to find an optimal set of features to use.

R2 RMSE
Train 0.793 0.0199
Test 0.787 0.0198

Model 5 performs slightly better than the Select K Best model, with an average R-squared value of .790, an RMSE of .0198, and an average cross validation score of 0.749.

Model Evaluation

Model 5, which used Recursive Feature Elimination was found to be the best MVP model. Below, we inspect model features and coefficients, and use the model to make predictions:

Coefficients

View the ten highest coefficients alongside their corresponding features :
[(25148.43406913098, 'Unsheltered_rate'),
(25128.704386069712, 'Houseless_rate_X_Sheltered_rate'),
(25121.82280037445, 'Houseless_rate_X_Percent_male'),
(25102.09215732087, 'Sheltered_rate'),
(12589.751692018015, 'Percent_Black_X_Percent_working'),
(12563.140386645082, 'Sheltered_rate_X_Percent_male'),
(12558.682413724084, 'Unsheltered_rate_X_Percent_male'),
(12538.952694053738, 'Houseless_rate_X_Percent_female'),
(118.73999449300752, 'Houseless_rate_X_Percent_working'),
(84.2136249355529, 'Unsheltered_rate_X_Percent_asian')]

From the output, we see that houseless rates intersected with race are the highest predictors for food insecurity, which is not surprising after exploring the relationships between race, houselessness, and FI rates during the EDA process.

Predictions

Model 5 is used to make predictions on 2018 data, which is mapped below alongside actual 2018 FI rates:

img img

The maps above indicate that the model was able to capture the general trends of food insecurity, especially in places which are the most significantly impacted. The model did not perform as well at capturing the severity of FI rates in certain areas, such as Maine, the Pacific Northwest, and the Southeast.
Note: Empty spaces indicate missing values due to a lack of data for one or more of the modeling features, for that county in the year 2018.

Conclusion

The final RFE Model 5 was able to explain about 75% of the variance in the data, based on a cross validation R2 score of 0.7486, and was off on predictions by an average of 2%, based on Test RMSE score of 0.0198. The most important features used in this model were shown to be Unsheltered_rate_X_Percent_asian, Sheltered_rate_X_Percent_white, Sheltered_rate_X_Percent_Black, Houseless_rate_X_Percent_asian, and Unsheltered_rate_X_Percent_male, based on model coefficients.
This indicates that both houselessness and race play a critical role in determining the likelihood of food insecurity, but especially so when these features interact (ie. someone who is both Black and houseless.)

The choropleth maps on 2018 FI Rate predictions indicate that the model was able to capture the general trend of food insecurity, particularly in places that are most impacted.

Next Steps

There is a lot of room to improve the model's ability to explain even more variance in the data, potentially by adding more features such as household income, food assistance programs, age, disability, eviction data, and transportation access.

The next iteration of this project aims to accomplish the following:

  • include more features
  • utilize statistical testing during EDA
  • employ different types of regression models
  • create an interactive map to visualize features and trends
  • project FI rates on unlabeled data from 2020

More Information

To explore the full project, please view the Jupyter Notebook files within the notebooks/ folder, and the presentation PDF document in the root of this repository.

For any additional questions, please contact Khyatee Desai - khyatee.d@gmail.com

Repository Structure

├── README.md
├── src
│   ├── functions.py		<- Functions used in EDA notebook 
├── notebooks
│   ├── cleaning_pt1.ipynb		<- Preliminary data collection and cleaning notebook
│   ├── cleaning_pt2.ipynb		<- Final data cleaning process
│   ├── EDA.ipynb			<- Data anlysis & visualization notebook
│   ├── feature_engineering.ipynb
│   ├── modeling_process.ipynb        <- Feature selection and modeling
│   ├── predictons.ipynb        <- Model evaluation and predictions
├── datasets				<- Directory of all datasets used 
│   ├── businesses
|	├── cps
|	├── demographics
|	├── feeding_america
|	├── household
|	├── houseless
|	├── income 
|	├── rent_prices
|	├── unemployment
|	├── shapefile
├── images				<- All images produced from EDA
├── pickled				<- Cleaned datasets and final model
│   ├── fully_cleaned_data.pickle
│   ├── feature_engineered_data.pickle
│   ├── rfe_features.pickle
└   ├── random_forest_model.pickle

About

Using regression models to project food insecurity rates in the US, by county for the year 2020 (Flatiron Capstone Project)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published