The objective of this project is to use the data provided by Airbnb to predict the most likely location that new customers will choose for their first booking given information about their social background. This solution will help to personalize the Airbnb experience, decrease booking search times, increase efficiency, and predict the general demand across different locations. The data is publicly available on the Kaggle website as a competition hosted by Airbnb. This type of data is significantly important because it will help people understand the categories of countries that people who share with similar backgrounds generally enjoy visiting for their first booking. This analysis will also help Airbnb distribute housing more accurately based on demand, which will allow higher customer satisfaction rates.
Please follow this instruction to run the code on your local machine.
- Python 2.7
- Jupyter Notebook
numpy : 0.7.0
pandas : 0.20.3
matplotlib : 2.1.0
seaborn : 0.8.0
ipywidgets : 7.0.0
xgboost
RandomizedLogisticRegression from sklearn.linear_model
RandomForestClassifier from sklearn.ensemble
Dataset Resource: https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings/data
Note: Please download the data and put them in folder /input
Please download the code from https://github.com/pengzai6666/Airbnb-New-User-Bookings.git
Run Team8_Data_Engineering.ipynb for data engineering part (visualization)
If you are also interested in the prediction part, we provide the preprocessing code at Team8_Preprocessing.ipynb.
In this project, both data engineering (80 %) and destination prediction (20 %) are done.
Here is the basic workflow:
- Visualization of Population in Destination Countries
- Visualization of Gender and Destination Related Stuffs
- Country destination booking ratio of [destination] vs [All other countries excluding Not Defined]
- Comparison of different country bookings with a set age threshold
- Visulization of Affiliate Related Stuffs
- Data Analysis of Affiliate Related Stuffs
- Time Interval Analysis (Time Interval is the time one user spends on Airbnb)
- Action Type Analysis (Action Type is the action one user is recorded in his activity log)
- Device Type Analysis (Device Type is the device that one user used to perform his action)
- Action Type Analysis regarding time elapsed
- Destination Distribution regarding to the frequency of the given action type
Here is the basic workflow:
- Preprocessing the dataset: which can be found in Team8_Preprocessing.ipynb
- Feature Selection: the functions we use are in Team8_Feature_Selection.ipynb
- Predictive Model: we just refer to the code of some users on kaggle, here are the links: https://www.kaggle.com/svpons/script-0-8655
https://github.com/svegapons/kaggle_airbnb
https://github.com/davidgasquez/kaggle-airbnb/blob/master/notebooks/User%20Data.ipynb
If there is anything wrong during the code testing, please contact s7qin@eng.ucsd.edu directly. Thanks