Using pandas and data visualisation libraries (Matplotlib, Seaborn), let's establish conclusions about a dataset.
Kai Yung (Adam) https://github.com/kaiyungtan Joffrey Bienvenu https://github.com/Joffreybvn Mathieu Leers https://github.com/leersmathieu |
Here is a quick introduction to our work: A map of all belgian's municipalities with their median price/m2. Click to enlarge
The mission is: Cleaning and doing a complete analysis and interpretation of the dataset created during the previous challenge. In order to create a machine learning model to predict prices on Belgium's real estate's sales.
- Using Pandas for data manipulation.
- Using MatplotLib and/or Seaborn for plotting.
- Finding and understanding correlations between dataset's variables.
To complete this challenge, we used a Dataset of 50.000 real estate's observations, previously scrapped by Kai Yung and its team during a previous challenge.
To get geographical informations about our data, we merged this dataset with the Postal Codes dataset from belgium.be and later, witht the Belgium Municipalities GeoJSON provided by ArcGis.
We choose to use this dataset because:
-
It has a lot of entries: And they are not totally clean ! One of the key exercice of this challenge, was to be able to work with uncomplete/corrupted data. And at the same time, having the maximum amount of data to discover interesting correlations, and have a meaningfull analyse.
-
It was scrapped from ImmoWeb: Immoweb is probably the biggest real estate website of Belgium. It has more detailed data than Zimmo, for example.
-
It was scrapped by colleagues: The dataset "dataset_house_apartment.csv" was scrapped by a Becode colleague and can be found on this repository.
In this dataset we have 52.077 rows and 20 columns.
You can download the raw CSV directly from this link.
Show the CSV architecture
- locality str: Postal code of city.
- type_of_property str: House or Apartment.
- subtype_of_property str: Bungalow, Villa, ...
- price float: Price (€) of the property.
- type_of_sale str The type of sale.
- number_of_rooms int: The number of rooms of the property.
- house_area int: The area (m2) of the house (floors).
- fully_equipped_kitchen bool: If the house has a fully equipped kitchen or not.
- furnished bool: If the house is furnished or not.
- open_firebool: If there's an open fire or not.
- terrace bool: If there's a terrace or not.
- terrace_area float: The area (m2) of the terrace.
- garden bool: If there's a garden or not.
- garden_area float: The area (m2) of the garden.
- surface_of_the_land float: The area (m2) of the land.
- surface_of_the_plot_of_land float: The area (m2) of the land's plot.
- number_of_facadess int: The number of facades (0 to 4).
- swimming_pool bool: If there's a swimming pool or not.
- state_of_the_building str: Good, As new, To be renovated, ...
- construction_year int: The property built's year.
Our goal is to clean and do a complete analysis and interpretation of the dataset. As we are manipulating real estate data, we decided to add geographical data to our dataset.
Based on the request, we decided to search for additional informations about the location of the dataframe's observations.
-
These information could help us to create maps, and visualize the data geographically with Folium
A much better way to understand the real estate situation in Belgium ! -
We used Folium because it allow to visualize data on a Leaflet map easily.
Folium is build on the data wrangling strengths of the Python ecosystem and the mapping strengths of the leaflet.js library. A very powerfull tool !
Merging with a Postal Codes dataset:
From the file code-postaux-belge.csv, which can be obtained from this link, we retrieved the following informations:
We dropped the other columns of this dataset (to keep only 4), as they were useless.
We also needed to associate the postal code with their provinces and regions. They were no dataset with these informations. So we created it from this wikipedia page.
Postal codes correspondence to Region/Province
- 1000–1299 : Région de Bruxelles-Capitale
- 1300–1499 : Province du Brabant wallon
- 1500–1999 : Province du Brabant flamand (arrondissement de Hal-Vilvorde, sauf Overijse)
- 2000–2999 : Province d'Anvers
- 3000–3499 : Province du Brabant flamand (arrondissement de Louvain, plus Overijse)
- 3500–3999 : Province de Limbourg
- 4000–4999 : Province de Liège
- 5000–5999 : Province de Namur
- 6000–6599 : Province de Hainaut (1)
- 6600–6999 : Province de Luxembourg
- 7000–7999 : Province de Hainaut (2)
- 8000–8999 : Province de Flandre-Occidentale
- 9000–9999 : Province de Flandre-Orientale
A function "convert_by_postal_code()" was created to map related postal code to their associated provinces and regions. You can see it in the Step 1-Cleaning notebook or below:
Expand this function
provinces = {
'Bruxelles-Capitale': {
'range': [(1000, 1299)],
'region': 'Bruxelles'
},
'Brabant wallon': {
'range': [(1300, 1499)],
'region': 'Wallonie'
},
'Brabant flamand': {
'range': [(1500, 1999), (3000, 3499)],
'region': 'Flandre'
},
'Anvers': {
'range': [(2000, 2999)],
'region': 'Flandre'
},
'Limbourg': {
'range': [(3500, 3999)],
'region': 'Flandre'
},
'Liège': {
'range': [(4000, 4999)],
'region': 'Wallonie'
},
'Namur': {
'range': [(5000, 5999)],
'region': 'Wallonie'
},
'Hainaut': {
'range': [(6000, 6599), (7000, 7999)],
'region': 'Wallonie'
},
'Luxembourg': {
'range': [(6600, 6999)],
'region': 'Wallonie'
},
'Flandre-Occidentale': {
'range': [(8000, 8999)],
'region': 'Flandre'
},
'Flandre-Orientale': {
'range': [(9000, 9999)],
'region': 'Flandre'
}
}
def convert_by_postal_code():
"""
Convert the 'provinces' dictionnary to a dictionnary of postal codes as key,
and its region and province as values.
"""
postal_dict = {}
# Loop through each entry in the "provinces" dictionnary
for province, content in provinces.items():
# Loop though each postal code
for postal_code_tuple in content['range']:
for code in range(postal_code_tuple[0], postal_code_tuple[1] + 1):
# Append the province and region for each postal code.
postal_dict[code] = {'province': province, 'region': content['region']}
return postal_dict
postal_dict = convert_by_postal_code()
# Then, the result is merged with our dataframe.
# Get the details on: https://github.com/kaiyungtan/challenge-data-analysis/blob/master/step1_cleaning.ipynb
The cleaning phase is very important: without a good cleaning, our analysis could be badly influenced by outliers. So, we decided to identify our neer during the analysis phase, to drop the useless column and clean the usefull ones.
To proceed to the analysis, we needed a clean dataset containing at least:
- Prices, postal code and city names.
- A price/m2 column
It's good to have a lot of columns, as it can create more correlations between them. However, it's bad to have columns with errors, incorrect, missing or absurd values. That's why we divided our cleaning in two phases:
A very first clean to the raw data. We were focused on "dropping the big lies":
- Dropping the duplicated rows
- Dropping columns with only one unique value
- Checking each columns's properties
Some tweaks were made on the dataset to remove outliers and useless columns, due to their high rate of None value. This steap required deeper investigation intop the data:
Show the details
- Dropping "terrace_area" column
It has more than 30% of None. - Dropping "garden_area" column
It has more than 50% of None. - Dropping "subtype" column
Lots of property subtype. Some with less than 100 entries, in a dataset of 50.000. This column was not relevant. - Removing the "Apartment blocks" entries
Apartment blocks are a whole building. It's not the kind of real estate sales we want here. - Changing None to "unknow"
We also refactored all float to int. At the end of the cleaning, we merged our dataframe with the two other ones created during the request study.
At the end of the cleaning phase, we had 40.395 rows (observations) and 18 columns. We also made a Pandas Profiling Report to prove the clea
This is where the fun start ! 🥳
Three people know more than one.
To get the most out of our data, and to allow each of us to get experience manipulating Pandas and Seaborn, we decided to work separately. Later we reviewed our work and merged the results.
The price is obviously the target of this challenge, as ou goal is to to create a machine learning model to predict prices on Belgium's sales.
To identify the correlation, we used this heatmap:
- The Price is mainly correlated with the Number of rooms and the House area.
- The Number of rooms and House area seems mainly correlated with each other.
- The Type of property is the variables which has the most correlation with other variables.
They are many other observations we did and we investigated. These are detailed in the step 2 notebooks of each of us.
Based on theses observations, we came to the following conclusion:
The Open fire, the Garden, the Location of the house (municipality) and its Number of facades determine the Type of property. Which influence greatly on the Number of rooms and the House area: An apartement will have less space and less rooms than a house.
Number of rooms and house area are two variables based on the size of the property. And they are the main influence on the Price: A larger house/apartment is more expensive than a smaller house/apartment.
For this challenge, we had to reply to the following questions:
- to create new column price/house area to answer price per square meter for the following questions.
(Average price, median price, price per square meter)
The following was calculated based on the average price/m2:
Belgium (€) | Average price | Median/sqm | Average/sqm |
---|---|---|---|
Knokke | 472 000 | 5500 | 5500 |
Leuven | 365 000 | 4100 | 4700 |
Ramskapelle | 356 000 | 4200 | 4400 |
(Average price, median price, price per square meter)
The following was calculated based on the average price/m2:
Wallonia (€) | Average price | Median/sqm | Average/sqm |
---|---|---|---|
Louvain-La-Neuve | 465 000 | 3800 | 3750 |
Thines | 550 000 | 3400 | 3440 |
Ottignies | 380 000 | 3400 | 3160 |
(Average price, median price, price per square meter)
The following was calculated based on the average price/m2:
Flanders (€) | Average price | Median/sqm | Average/sqm |
---|---|---|---|
Knokke | 472 000 | 5500 | 5500 |
Leuven | 365 000 | 4100 | 4700 |
Ramskapelle | 356 000 | 4200 | 4400 |
(Average price, median price, price per square meter)
The following was calculated based on the average price/m2:
Belgium (€) | Average price | Median/sqm | Average/sqm |
---|---|---|---|
Beauwelz | 70 000 | 350 | 350 |
Focant | 80 000 | 390 | 390 |
Nollevaux | 139 000 | 420 | 420 |
(Average price, median price, price per square meter)
The following was calculated based on the average price/m2:
Belgium (€) | Average price | Median/sqm | Average/sqm |
---|---|---|---|
Beauwelz | 70 000 | 350 | 350 |
Focant | 80 000 | 390 | 390 |
Nollevaux | 139 000 | 420 | 420 |
(Average price, median price, price per square meter)
The following was calculated based on the average price/m2:
Flanders (€) | Average price | Median/sqm | Average/sqm |
---|---|---|---|
Bossuit | 220 000 | 700 | 700 |
Elverdinge | 290 000 | 730 | 730 |
Wijtschate | 100 000 | 830 | 830 |
Brussels (€) | Average price | Median/sqm | Average/sqm |
---|---|---|---|
Auderghem | 429326 | 392500 | 4191 |
Molenbeek-Saint-Jean | 234724 | 219000 | 2288 |
Wallonia (€) | Average price | Median/sqm | Average/sqm |
---|---|---|---|
La Hulpe | 346000 | 332500 | 3898 |
Villers-Sur-Semois | 14500 | 14500 | 517 |
Flanders (€) | Average price | Median/sqm | Average/sqm |
---|---|---|---|
Knokke | 550494 | 515000 | 6363 |
Kermt | 229500 | 229500 | 1213 |
Brussels (€) | Average price | Median/sqm | Average/sqm |
---|---|---|---|
Watermael-Boitsfort | 637965 | 595000 | 3426 |
Koekelberg | 377500 | 330000 | 1725 |
Wallonia (€) | Average price | Median/sqm | Average/sqm |
---|---|---|---|
Louvain-La-Neuve | 595200 | 580000 | 3159 |
Beauwelz | 70000 | 70000 | 350 |
Flanders (€) | Average price | Median/sqm | Average/sqm |
---|---|---|---|
Boutersem | 443245 | 360000 | 3750 |
Bossuit | 220000 | 220000 | 698 |
-
The first difficulty was to find a method of collaborative work adapted to our desires.
We chose to work via google colab pages and to gather valid cells on this github repo. -
The second "big" difficulty was to quickly learn how to use tools such as matplotlib or seaborn (for data visualisation). This was solved by working on our own, and sharing our work with each other.
-
Another challenge has been the fair distribution of work.
We do not all have the same ease of understanding in statistics and programming... So everyone was doing their best, and then we merged our result. -
We encountered a small problem about the file name due to the fact that we don't work on the same environment (Win/Ubuntu...).
One of the difficulties encountered with graphs, was that if we takeall the values without filtering them a minimum, then the visualization is not very relevant.