-
Notebooks: Summary Notebook
- Business Understanding
- Data Understanding
- Feature Engineering
- Model & Prediction
- Conclusions
- Further Steps
- Project Info
The context of this project is a boutique real estate firm opening in King County Washington, seeking to make a move into the market by gaining increasing their consumers. In order to do this, they've tasked their data scientists with creating a model which will predict the estimated housing price for potential clients. The desired output of this project is a simplified system which the UX team can implement on their website that will allow users to input a limited amount of known factors about their home, and output an estimated price range. They're hoping that the efficiency, accuracy, and speed of return for clients considering selling their home will 1) entice consumers towards the efficiency of their business, and 2) enable their real estate associates to quickly engage with potential clients based on model estimates. So our guiding questions were:
-
Which 5 are the top most influencing factors upon home price?
-
Given data about these factors from the potential client's home, what will be the predicted selling price of their home?
The data given for this project came from housing sales data from King County Washington for houses sold between May 2014 and May 2015. Combining the information revealed from the dataset with industry relevant knowledge revealed a few categories of features which we thought would have an impact upon the sales price of the house. The sets we used, specifically included
- King County Government House Data
- King County GIS Open Data Platform
- School Performance Data from Background Checks.org
Overall, the data provided included the sales price and other information for around 21,000 houses sold between May 2014 and My 2015.
The distribution of prices looked like this:
What's the impact on the selling price if the home is in a top school district, in the Seattle metro area, or on the waterfront?
What size datapoints most impact the sales price of the house? Bedrooms, bathrooms, floors, basement space, etc?
...will your house be blown down? How effective can King County condition and grade values be in predicting the value of your home?
The sales data we used was from Maay 2014 - May 2015. This would prevent us from being able to see any longer term trends in the change of housing price between years. Additionally, this analysis did not account for inflation. Finally, we would have liked to have more recent data to assess any pattern or trends in sales price impact by current events (pandemics, protests, politics).
The model goes for simplicity over predictive accuracy - so the model does not explain every variationin price. The business purpose is to enable people with limited information to get a general range, and allow the real estate firm to do the rest in terms of specifics.
We limited the data for the model to houses above worth more than 100,000 USD and less than 1,000,000. Given the business being boutique and new in the market, the business goal was refined to be worth initial investments, but not beyond the expertise of associates new to the Seattle scene.
Additionally, based on the business understanding and industry knowledge, we made several assumptions about the data in order to qualify the predictive accuracy of the model, including
We assumed that homes marked as renovated by King County were total renovations - not just partial (e.g., kitchen, bathroom, etc.).
We assumed that the potential client would know the location of their home.
Given the categories of data listed above, we selected certain factors within the dataset, and created or found new features that would help to create a more accurate pricing model. The new features were:
- Defining top school districts
- Creating a "season sold" factor
- Reshaping categorical values from continuous variables (e.g., grade/condition)
- Combining features, like ratio of basement space to living space
- Creating a user input function
We used the Ordinary Least Squares Regression to create a model which would help determine the most impactful factors and help us more accurately predict the prices.
In order for this model to work the most effectively, we evaluated it for linearity between factors, and multicollinearity between pairs of factors, which gave us the following:
After iterating through, and making adjustments to, and evaluating 3 regression models, we adopted the final one using the highest R-squared value as a measure of performance. With more time we could have further refined the model to account for what appears to be an exponential function from the residuals plot:
Given the model we produced, we were able to create functions that would take the user inputs and output a prediction. In this example, the potential client would input the numbers after the colon:
The most predictive factors in home price were:
Price would increase by 180k USD for top 5 school district, by 210k for homes outside the city, and 480k for homes on the waterfront
Basement square footage is worth less than square footage above ground, and too many bedrooms could lower the value of the house.
Homes with high grades (good architecture and build quality) are worth significantly more than lower grades, and homes in a very good condition have a relatively significant impact over just average homes.
Homes sold in spring/summer sold significantly more than those sold in fall/winter - presumably from that gloomy Seattle rain :-P
With less of a time constraint, we would:
- Fix normality issues in model.
- Fix heteroskedasticity issues in model.
- Fix multicollinearity issues in model
- Test effectiveness of model using test data.
- Explore industry and create new features
- Deploy consumer product
Given additional resources, we would recommend the following to the firm:
- Collect additional relevant data around other factors, using expertise
- Open up to home prices above 1 million
- Provide more time-relevant data
Contributors: Alexander Newton, Jim Fay
Languages : Python
Tools/IDE : Git, Powershell (Windows), Anaconda, Jupyter Notebook, Google Slides
Libraries : numpy, pandas, matplotlib, seaborn, statsmodels, scipy, geopandas, descartes, shapely
Duration : July 2020 Last Update: 07.10.2020