Challenge: Data Analysis

Using pandas and data visualisation libraries (Matplotlib, Seaborn), let's establish conclusions about a dataset.

Team member:

Kai Yung (Adam)

https://github.com/kaiyungtan
Joffrey Bienvenu

https://github.com/Joffreybvn
Mathieu Leers

https://github.com/leersmathieu

Price/m2 Map:

Here is a quick introduction to our work: A map of all belgian's municipalities with their median price/m2. Click to enlarge

Challenge's summary

The mission is: Cleaning and doing a complete analysis and interpretation of the dataset created during the previous challenge. In order to create a machine learning model to predict prices on Belgium's real estate's sales.

Objectives:

Using Pandas for data manipulation.
Using MatplotLib and/or Seaborn for plotting.
Finding and understanding correlations between dataset's variables.

The dataset: a 50.000 entries' one !

To complete this challenge, we used a Dataset of 50.000 real estate's observations, previously scrapped by Kai Yung and its team during a previous challenge.

To get geographical informations about our data, we merged this dataset with the Postal Codes dataset from belgium.be and later, witht the Belgium Municipalities GeoJSON provided by ArcGis.

Why this dataset ?

We choose to use this dataset because:

It has a lot of entries: And they are not totally clean ! One of the key exercice of this challenge, was to be able to work with uncomplete/corrupted data. And at the same time, having the maximum amount of data to discover interesting correlations, and have a meaningfull analyse.
It was scrapped from ImmoWeb: Immoweb is probably the biggest real estate website of Belgium. It has more detailed data than Zimmo, for example.
It was scrapped by colleagues: The dataset "dataset_house_apartment.csv" was scrapped by a Becode colleague and can be found on this repository.

In this dataset we have 52.077 rows and 20 columns.

CSV architecture

You can download the raw CSV directly from this link.

Show the CSV architecture

locality str: Postal code of city.
type_of_property str: House or Apartment.
subtype_of_property str: Bungalow, Villa, ...
price float: Price (€) of the property.
type_of_sale str The type of sale.
number_of_rooms int: The number of rooms of the property.
house_area int: The area (m2) of the house (floors).
fully_equipped_kitchen bool: If the house has a fully equipped kitchen or not.
furnished bool: If the house is furnished or not.
open_firebool: If there's an open fire or not.
terrace bool: If there's a terrace or not.
terrace_area float: The area (m2) of the terrace.
garden bool: If there's a garden or not.
garden_area float: The area (m2) of the garden.
surface_of_the_land float: The area (m2) of the land.
surface_of_the_plot_of_land float: The area (m2) of the land's plot.
number_of_facadess int: The number of facades (0 to 4).
swimming_pool bool: If there's a swimming pool or not.
state_of_the_building str: Good, As new, To be renovated, ...
construction_year int: The property built's year.

The Challenge

1. Request study & technical challenge

Our goal is to clean and do a complete analysis and interpretation of the dataset. As we are manipulating real estate data, we decided to add geographical data to our dataset.

Adding geographical data

Based on the request, we decided to search for additional informations about the location of the dataframe's observations.

Why ?

These information could help us to create maps, and visualize the data geographically with Folium
A much better way to understand the real estate situation in Belgium !
We used Folium because it allow to visualize data on a Leaflet map easily.
Folium is build on the data wrangling strengths of the Python ecosystem and the mapping strengths of the leaflet.js library. A very powerfull tool !

Merging with a Postal Codes dataset:

From the file code-postaux-belge.csv, which can be obtained from this link, we retrieved the following informations:

column_1 postal_code int
column_2 city_name str
column_3 longitude float
column_4 lattitude float

We dropped the other columns of this dataset (to keep only 4), as they were useless.

Adding the Provinces and Regions

We also needed to associate the postal code with their provinces and regions. They were no dataset with these informations. So we created it from this wikipedia page.

Postal codes correspondence to Region/Province

1000–1299 : Région de Bruxelles-Capitale
1300–1499 : Province du Brabant wallon
1500–1999 : Province du Brabant flamand (arrondissement de Hal-Vilvorde, sauf Overijse)
2000–2999 : Province d'Anvers
3000–3499 : Province du Brabant flamand (arrondissement de Louvain, plus Overijse)
3500–3999 : Province de Limbourg
4000–4999 : Province de Liège
5000–5999 : Province de Namur
6000–6599 : Province de Hainaut (1)
6600–6999 : Province de Luxembourg
7000–7999 : Province de Hainaut (2)
8000–8999 : Province de Flandre-Occidentale
9000–9999 : Province de Flandre-Orientale

A function "convert_by_postal_code()" was created to map related postal code to their associated provinces and regions. You can see it in the Step 1-Cleaning notebook or below:

Expand this function

provinces = {
    'Bruxelles-Capitale': {
        'range': [(1000, 1299)],
        'region': 'Bruxelles'
    },
    'Brabant wallon': {
        'range': [(1300, 1499)],
        'region': 'Wallonie'
    },
    'Brabant flamand': {
        'range': [(1500, 1999), (3000, 3499)],
        'region': 'Flandre'
    },
    'Anvers': {
        'range': [(2000, 2999)],
        'region': 'Flandre'
    },
    'Limbourg': {
        'range': [(3500, 3999)],
        'region': 'Flandre'
    },
    'Liège': {
        'range': [(4000, 4999)],
        'region': 'Wallonie'
    },
    'Namur': {
        'range': [(5000, 5999)],
        'region': 'Wallonie'
    },
    'Hainaut': {
        'range': [(6000, 6599), (7000, 7999)],
        'region': 'Wallonie'
    },
    'Luxembourg': {
        'range': [(6600, 6999)],
        'region': 'Wallonie'
    },
    'Flandre-Occidentale': {
        'range': [(8000, 8999)],
        'region': 'Flandre'
    },
    'Flandre-Orientale': {
        'range': [(9000, 9999)],
        'region': 'Flandre'
    }
}

def convert_by_postal_code():
  """
  Convert the 'provinces' dictionnary to a dictionnary of postal codes as key,
  and its region and province as values.
  """
  postal_dict = {}

  # Loop through each entry in the "provinces" dictionnary
  for province, content in provinces.items():

    # Loop though each postal code
    for postal_code_tuple in content['range']:
      for code in range(postal_code_tuple[0], postal_code_tuple[1] + 1):

        # Append the province and region for each postal code.
        postal_dict[code] = {'province': province, 'region': content['region']}

  return postal_dict

postal_dict = convert_by_postal_code()

# Then, the result is merged with our dataframe.
# Get the details on: https://github.com/kaiyungtan/challenge-data-analysis/blob/master/step1_cleaning.ipynb

2. Data Cleaning

The cleaning phase is very important: without a good cleaning, our analysis could be badly influenced by outliers. So, we decided to identify our neer during the analysis phase, to drop the useless column and clean the usefull ones.

Identifing the needs:

To proceed to the analysis, we needed a clean dataset containing at least:

Prices, postal code and city names.
A price/m2 column

Removing the outliers (error, incorrect or absurd).

It's good to have a lot of columns, as it can create more correlations between them. However, it's bad to have columns with errors, incorrect, missing or absurd values. That's why we divided our cleaning in two phases:

Cleaning the raw:

A very first clean to the raw data. We were focused on "dropping the big lies":

Dropping the duplicated rows
Dropping columns with only one unique value
Checking each columns's properties

Refining the values

Some tweaks were made on the dataset to remove outliers and useless columns, due to their high rate of None value. This steap required deeper investigation intop the data:

Show the details

Dropping "terrace_area" column
It has more than 30% of None.
Dropping "garden_area" column
It has more than 50% of None.
Dropping "subtype" column
Lots of property subtype. Some with less than 100 entries, in a dataset of 50.000. This column was not relevant.
Removing the "Apartment blocks" entries
Apartment blocks are a whole building. It's not the kind of real estate sales we want here.
Changing None to "unknow"

We also refactored all float to int. At the end of the cleaning, we merged our dataframe with the two other ones created during the request study.

How many rows and columns ?

At the end of the cleaning phase, we had 40.395 rows (observations) and 18 columns. We also made a Pandas Profiling Report to prove the clea

3. Data Analysis & Interpretation

This is where the fun start ! 🥳

Three people know more than one.

To get the most out of our data, and to allow each of us to get experience manipulating Pandas and Seaborn, we decided to work separately. Later we reviewed our work and merged the results.

Our target: The Price.

The price is obviously the target of this challenge, as ou goal is to to create a machine learning model to predict prices on Belgium's sales.

Correlation between variables ?

To identify the correlation, we used this heatmap:

Observations:

The Price is mainly correlated with the Number of rooms and the House area.
The Number of rooms and House area seems mainly correlated with each other.
The Type of property is the variables which has the most correlation with other variables.

They are many other observations we did and we investigated. These are detailed in the step 2 notebooks of each of us.

Conclusions:

Based on theses observations, we came to the following conclusion:

The Open fire, the Garden, the Location of the house (municipality) and its Number of facades determine the Type of property. Which influence greatly on the Number of rooms and the House area: An apartement will have less space and less rooms than a house.

Number of rooms and house area are two variables based on the size of the property. And they are the main influence on the Price: A larger house/apartment is more expensive than a smaller house/apartment.