- 📁 Data
- 📓 Description
- ✔️ Objective
- 🧱 Tech stack
- 💹 Algorithms
- 📊 Visualization
- 🤓 Conclusions
- ©️ Credits
The data is available at Kaggle using the following link:
https://www.kaggle.com/c/boston-housing
And it's official link: http://lib.stat.cmu.edu/datasets/boston
This is a quick resolution into a common dataset used for algorithms benchmarking. I've done extensive notebooks, and benchmarking, notebooks where I've explained my thought process and what I did understand about the dataset, this is NOT one of them. This is a quick gaze into a dataset resolution.
Sorry if you were looking for that, but, my California Housing Pricing project does go into further detail and is also a housing-related dataset.
To develop an algorithm that gets as less Minimum Squared Error on the test data as possible. A simpe (not to say easy) objective makes it easier to focus on the project.
Python, that's it! R is a programming language that, as for the moment being, I have no experience with, even though it's powerful and broadly used, but I'd dare to say that no more than Python.
And one of the strongest points, if not the most, about Python, are it's libraries, so... the libraries I've used are:
- Pandas, data manipulation with an ease of use and exploration data analysis.
- Numpy, a really strong linear algebra library, used in the project for it's statistics utilities, SciPy may be an alternative, but I have no experience at all with it.
- Matplotlib and Seaborn, both fantastic libraries for data visualization, and they complement each other.
- Scikit-Learn, the library used for Machine Learning and statistics models: Linear Regression, SVR, Lasso, Ridge, etc.
- Tensorflow and Keras, the industry standard for Deep Learning, the way to go, not really, it's just that for now I don't have that many experience with PyTorch
I've basically used, Linear Regression, XGBoost and RandomForest, they're both powerful algorithms.
- Linear Regression with Polynomial Features is the simplest of the three, but it fits the problem. Although it does not perform as well as the other two.
- Random Forest did perform the "best", except not really, it did not get the highest score on the test data, but it's still a damn good algorithm.
- XGBoost, the winner for this round, a powerful algorithm that's been taking the price lately, and for good reason.
There's nearly any visualization, only some correlation and distribution plots, and, the actual main dish, the regression plot, it's borrowed from Prasad Perera's notebook
Supervised or unsupervised, what truly makes a good algorithm is not the algorithm itself, but good, quality data. Obviously a good algorithm will give better score, and an adequate one will outmark an incredibly good algorithm.
But quality data is what makes up most of the data science projects. In this case I had to worry not about the data, since it was all cleaned and ready to work since the start.
All of the credits for the datasets goes to Kaggle, a company owned by Google. And, mostly, to the U.S Census Service who were the ones recolecting the data itself. And credits to Prasad Perera. I learned new visualization tricks that deemed very useful.