King County Housing - Pricing Predictions

Overview

Repository Navigation

ReadMe Navigation

Business Understanding
Data Understanding
Feature Engineering
Model & Prediction
Conclusions
Further Steps
Project Info

Business Understanding

The context of this project is a boutique real estate firm opening in King County Washington, seeking to make a move into the market by gaining increasing their consumers. In order to do this, they've tasked their data scientists with creating a model which will predict the estimated housing price for potential clients. The desired output of this project is a simplified system which the UX team can implement on their website that will allow users to input a limited amount of known factors about their home, and output an estimated price range. They're hoping that the efficiency, accuracy, and speed of return for clients considering selling their home will 1) entice consumers towards the efficiency of their business, and 2) enable their real estate associates to quickly engage with potential clients based on model estimates. So our guiding questions were:

Which 5 are the top most influencing factors upon home price?
Given data about these factors from the potential client's home, what will be the predicted selling price of their home?

Data Understanding

The data given for this project came from housing sales data from King County Washington for houses sold between May 2014 and May 2015. Combining the information revealed from the dataset with industry relevant knowledge revealed a few categories of features which we thought would have an impact upon the sales price of the house. The sets we used, specifically included

Overall, the data provided included the sales price and other information for around 21,000 houses sold between May 2014 and My 2015.

An overall map of the data:

The distribution of prices looked like this:

Data Categories

Location, location, location

What's the impact on the selling price if the home is in a top school district, in the Seattle metro area, or on the waterfront?

The size of the house in the fight...?

What size datapoints most impact the sales price of the house? Bedrooms, bathrooms, floors, basement space, etc?

Quality: I'll huff, and I'll puff, and...

...will your house be blown down? How effective can King County condition and grade values be in predicting the value of your home?

Data Limitations

Date

The sales data we used was from Maay 2014 - May 2015. This would prevent us from being able to see any longer term trends in the change of housing price between years. Additionally, this analysis did not account for inflation. Finally, we would have liked to have more recent data to assess any pattern or trends in sales price impact by current events (pandemics, protests, politics).

Model Simplicity

The model goes for simplicity over predictive accuracy - so the model does not explain every variationin price. The business purpose is to enable people with limited information to get a general range, and allow the real estate firm to do the rest in terms of specifics.

Price Range

We limited the data for the model to houses above worth more than 100,000 USD and less than 1,000,000. Given the business being boutique and new in the market, the business goal was refined to be worth initial investments, but not beyond the expertise of associates new to the Seattle scene.

Data Assumptions

Additionally, based on the business understanding and industry knowledge, we made several assumptions about the data in order to qualify the predictive accuracy of the model, including

Renovations were total

We assumed that homes marked as renovated by King County were total renovations - not just partial (e.g., kitchen, bathroom, etc.).

Access to location data

We assumed that the potential client would know the location of their home.

Feature Engineering

Given the categories of data listed above, we selected certain factors within the dataset, and created or found new features that would help to create a more accurate pricing model. The new features were:

Defining top school districts
Creating a "season sold" factor
Reshaping categorical values from continuous variables (e.g., grade/condition)
Combining features, like ratio of basement space to living space
Creating a user input function

Model & Prediction

We used the Ordinary Least Squares Regression to create a model which would help determine the most impactful factors and help us more accurately predict the prices.

Test Assumptions (Linearity, Multicollinearity)

In order for this model to work the most effectively, we evaluated it for linearity between factors, and multicollinearity between pairs of factors, which gave us the following:

Iterate and Evaluate Models

After iterating through, and making adjustments to, and evaluating 3 regression models, we adopted the final one using the highest R-squared value as a measure of performance. With more time we could have further refined the model to account for what appears to be an exponential function from the residuals plot:

Predict Values with Example Inputs

Given the model we produced, we were able to create functions that would take the user inputs and output a prediction. In this example, the potential client would input the numbers after the colon:

Conclusions

The most predictive factors in home price were:

Location

Price would increase by 180k USD for top 5 school district, by 210k for homes outside the city, and 480k for homes on the waterfront

Home Size

Basement square footage is worth less than square footage above ground, and too many bedrooms could lower the value of the house.

Quality

Homes with high grades (good architecture and build quality) are worth significantly more than lower grades, and homes in a very good condition have a relatively significant impact over just average homes.

When to Sell

Homes sold in spring/summer sold significantly more than those sold in fall/winter - presumably from that gloomy Seattle rain :-P

Further Steps

Internal to Data Science Team

With less of a time constraint, we would:

Fix normality issues in model.
Fix heteroskedasticity issues in model.
Fix multicollinearity issues in model
Test effectiveness of model using test data.
Explore industry and create new features
Deploy consumer product

For Real Estate Team

Given additional resources, we would recommend the following to the firm:

Collect additional relevant data around other factors, using expertise
Open up to home prices above 1 million
Provide more time-relevant data

Project Info

Contributors: Alexander Newton, Jim Fay

Languages : Python

Tools/IDE : Git, Powershell (Windows), Anaconda, Jupyter Notebook, Google Slides

Libraries : numpy, pandas, matplotlib, seaborn, statsmodels, scipy, geopandas, descartes, shapely

Duration : July 2020 Last Update: 07.10.2020

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Data		Data
Images		Images
Notebooks		Notebooks
Presentation		Presentation
README.ipynb		README.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

King County Housing - Pricing Predictions

Overview

Repository Navigation

ReadMe Navigation

Business Understanding

Data Understanding

Data Categories

Location, location, location

The size of the house in the fight...?

Quality: I'll huff, and I'll puff, and...

Data Limitations

Date

Model Simplicity

Price Range

Data Assumptions

Renovations were total

Access to location data

Feature Engineering

Model & Prediction

Test Assumptions (Linearity, Multicollinearity)

Iterate and Evaluate Models

Predict Values with Example Inputs

Conclusions

Location

Home Size

Quality

When to Sell

Further Steps

Internal to Data Science Team

For Real Estate Team

Project Info

About

Releases

Packages

Contributors 2

Languages

anewt225/mod_2_project

Folders and files

Latest commit

History

Repository files navigation

King County Housing - Pricing Predictions

Overview

Repository Navigation

ReadMe Navigation

Business Understanding

Data Understanding

Data Categories

Location, location, location

The size of the house in the fight...?

Quality: I'll huff, and I'll puff, and...

Data Limitations

Date

Model Simplicity

Price Range

Data Assumptions

Renovations were total

Access to location data

Feature Engineering

Model & Prediction

Test Assumptions (Linearity, Multicollinearity)

Iterate and Evaluate Models

Predict Values with Example Inputs

Conclusions

Location

Home Size

Quality

When to Sell

Further Steps

Internal to Data Science Team

For Real Estate Team

Project Info

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages