Skip to content

Latest commit

 

History

History
207 lines (154 loc) · 10.6 KB

File metadata and controls

207 lines (154 loc) · 10.6 KB

Ames, Iowa Housing Data

Real Estate Statistical Modeling with Feature Engineering


Problem Statement:

When it comes to appraisal of individual real estate, the quality and quantity of many physical property attributes must be considered for accuracy. Examples of typical questions are not limited to location, size, and the ratio of bedrooms-to-bathrooms. In this project, supervised machine learning is used for the prediction of more than 800 housing prices in Ames, Iowa from 2006-2010. It is demonstrated that the baseline null model can be improved upon by the inclusion of carefully chosen explanatory features. Furthermore, the model produces data-driven insights capable of informing decisions to optimize ROI as it relates to the sale of, or investment in, individual properties.

Exploration of the following specific questions:

  • Are location and size the most important factors in predicting the selling price of individual real estate?
  • What data-driven decisions can property owners make to maximize ROI?

Table of Contents


EDA & Data Cleaning

Datasets

Raw Training Dataset

2,051 homes/properties: ~70%

Raw Validation Dataset

878 homes/properties: ~30%

Processed Datasets:

  • clean_train.csv: Subset of train.csv after cleaning in Notebook 01, saved for use in all notebooks.
  • cat_select_test_m4.csv: Subset of data from categorical features included in Model 4, saved for Kaggle submission.
  • cat_select_train.csv: Subset of data from categorical features included in Model 4, saved for Kaggle submission.

Notes about the data:

  • 80 Explanatory Variables:
    • 34 Numerical:
  • 14 Discrete
    • General Description: Number of types of items present, or dates of construction/remodeling.
      • NOTE: Consider whether values are better represented as categorical or continuous.
  • 20 Continuous
    • General Description: Various area-dimensions of each property. * 46 Categorical:
  • 23 Ordinal
    • General Description: Various ratings of items specific to each property.
  • 23 Nominal
    • General Description: Various types of items/materials/conditions.
  • Time was spent searching for null values, dropping outliers, and recognizing the simplest approach was a first model with only integer features.

    Feature Selection Methodology

    1. By referring to the Data Library for definitions of sub-categorical features, we can follow best judgement for an initial prediction to beat the baseline Null.
    2. Successive feature selection was conducted iteratively by adjusting parameters after visualizing feature distributions with box and violin plots in Notebook 02.
    3. feedback from statsmodels in Notebook 03 I interpretted p-values to improve my RMSE computed in Notebook 04

Distribution of the Housing Sale Price

Figure 1: Distribution of the Target:

Left: Sale Price, Right: Log-Level Transformation of Sale Price
NOTICE: The distribution of Sale Price approaches NORMAL following the transfromation. Q: How much does this improve the model?

EDA

  • Missing Values:
    • Garage Area
  • Identify Outliers:
  • Are relationships to the target linear?

Data Cleaning

  • Null Value Imputation:
  • Manage Outliers:
  • Combining Features:
  • Interaction Terms:
  • Do you want to manually drop collinear features?


Pre-Processing and Feature Engineering

Back to Top

Distribution of the Housing Sale Price

Distribution of the Housing Sale Price

Figure 2: Sale Price VS. Greater Living Area

-with Neighborhoods grouped into classes and visualized by hue 0(blue), 1(orange), or 2(green)

Class 2: Mean Sale Price (All sales) < 1st Quartile (Class 2)
Class 1: Mean Sale Price (All sales) > 3rd Quartile (Class 1)
Class 0: Mean Sale Price (All sales) ~ 2nd Quartile (Class 0 or "other neighborhoods")

Pre-processing

  • Set-up Models > Unsupervised Clustering
  • One-hot encode categorical variables
  • Train/test split the data
  • Scale the data
  • Consider using automated feature selection


Model Benchmarks and Preparation

Back to Top


Model Tuning & Assessment

Back to Top

Various metrics, such as MSE, MAE, and adjusted R-Squared, are displayed for evaluation of four linear regression models. The "regression_metrics()" function streamlines the performance evaluation of each model included for comparison.


Production Model & Insights

Back to Top

Pave your driveway! There are many things to consider when selling your home, and yes size plays a role in the final Sale Price, but there are many things that can be done to improve the value of your home rgardless of size. Installing A/C, investing in quality insulation and heating improve the expected worth of homes in this dataset. Since the sample data were representative of the housing population in Ames at the time, property owners would have benefitted from fixing them up before selling.


Recommendations and Next Steps

Back to Top


Kaggle Submissions

Back to Top


Software Requirements

Back to Top


Acknowledgements and Contact:

Back to Top

External Resources:

  • [Journal of Statistics Education] (Taylor & Francis Online): (source)
  • [Feature Engineering and Selection: A Practical Approach for Predictive Models] (Online Text): (source)
  • [Title] (Platform): (source)

Papers:

  • Ames, Iowa: Alternative to the Boston Housing Data (Journal of Statistics Education): (source)
  • Title (Journal/Blog): (source)

Contact:


Appendix: Data Dictionary

Back to Top

Feature Type Dataset Category Description
Bldg TypeC integer 2006-10
Ames, Iowa
Assessor’s Office
'1Fam':0
'TwnhsE':0
'Twnhs':1
'2fmCon':1
'Duplex':1
Type of building
Central AirC integer 2006-10
Ames, Iowa
Assessor’s Office
'Y':0, 'N':1 Central Air-Conditioning
Condition 1C integer 2006-10
Ames, Iowa
Assessor’s Office
0: Other, 1:Near/adjacent to major streets/Railroads, 2:Near/Adjacent to poz off-site feature Proximity to various conditions
Exterior 1stC integer 2006-10
Ames, Iowa
Assessor’s Office
Asbestos Shingles:1, else:0 Exterior covering on house
Exter QualC integer 2006-10
Ames, Iowa
Assessor’s Office
Good/Typical:0, Fair:1, Excellent:2 Quality of material on exterior
FoundationC integer 2006-10
Ames, Iowa
Assessor’s Office
CinderBlock/Stone/Wood:0, Brick/Slab:1, PouredConcrete:2 Type of Foundation
Land ContourC integer 2006-10
Ames, Iowa
Assessor’s Office
Banked-Quick and significant rise from street grade to building:1, Other:0 Flatness of the property
Lot ConfigC integer 2006-10
Ames, Iowa
Assessor’s Office
Cul-de-sac:1, Other:0 Lot Configuration
Lot ShapeC integer 2006-10
Ames, Iowa
Assessor’s Office
Most Irregular: 1, Otherwise:0 General shape of property
1st Flr SF integer 2006-10
Ames, Iowa
Assessor’s Office
Continuous First Floor Square-Feet
Garage Area integer 2006-10
Ames, Iowa
Assessor’s Office
Continuous Size of garage in Square-Feet
Gr Liv Area integer 2006-10
Ames, Iowa
Assessor’s Office
Continuous Above grade living area in Square-Feet
Heating QCC integer 2006-10
Ames, Iowa
Assessor’s Office
_1:No 2:Yes Heating Quality & Condition
NeighborhoodC integer 2006-10
Ames, Iowa
Assessor’s Office
2: StoneBr, NridgHt, NoRidge
1: Sawyer, BrDale, IDOTRR, MeadowV, SWISU, BrkSide, NPkVill, Blueste, Landmrk
0: All others
Physical locations within the Ames city limits classified by sample average-'Sale Price'.
Class 0 avg Sale Prices were less than 1 stdev above/below the sample avg., Class 02 were above, Class 01 below.
Overall Qual integer 2006-10
Ames, Iowa
Assessor’s Office
Ordinal: 10-Very Excellent to 1-Very Poor Overall quality of material and finish
Paved DriveC integer 2006-10
Ames, Iowa
Assessor’s Office
1:No/Partial, 0:Yes Paved driveway
Total Bsmt SF integer 2006-10
Ames, Iowa
Assessor’s Office
Continuous Total Square-Feet of Basement Area
Year Built integer 2006-10
Ames, Iowa
Assessor’s Office
Discrete, range: 1872 - 2010 Original Construction Date
Year Remod/Add integer 2006-10
Ames, Iowa
Assessor’s Office
Discrete, range: 1950 - 2010 Remodel date or construction date if no remodel or additions

Back to Top