Skip to content

git-masoud/airbnb-data-preparaton-task

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data engineer task challenge

Task Description

As part of our initiative to transform our project into a booking platform, we will be offering hotels and apartments through our platform. Imagine 2 years from now, that we have been hugely successful in offering apartments (like AirBnb) on our website. In order to offer the best price, we want to come up with a price prediction model, which uses the historical data of our booking platform.

  • Task
    • You get from us a data sample which contains historical listings from AirBnb. However, in its current state, it is not suitable to be used right away by our data scientists who want to focus on the model building without having to modify the data much further. Your task is now the following: Overall Goal: Prepare the provided dataset for deriving a price prediction model.
  • Sub Tasks:
    • Document thoroughly your approach for manipulating the dataset.
    • Document how you intend your code to interface with the source system and the data scientists' models and how the data could be continuously updated (make assumptions where necessary).
    • Document how your code could be scaled if confronted with terabytes of data. Alternatively, implement mechanisms that allow seamless scaling.
  • Bonus Point:
    • Use the processed data to derive a price prediction model.

Submission

Price prediction data pipeline

Installation and Execution

Getting Started

These instructions will make you two pipelines in airflow which they would run on your local machine.

Prerequisites

Java Version 8

Scala Version 2.11.9

sbt

Airflow

Spark

csvtool sudo apt-get install csvtool

Execution

Clone the project from: git clone https://github.com/git-masoud/airbnb-data-preparaton-task.git

Navigate to the project root and run config bash script:

bash extra/bash/config.sh 
export AIRFLOW_DAG_PATH=[Path of airflow]/dags 

This will download the data, configure airflow, run airflow webserver and airflow scheduler. After that by opening airflow UI you should be able to find airbnb_data_preparation_demo and airbnb_Ml_pipeline_demo. You can trigger them and the rest will be done by the pipeline.

5. Project Structure

── airflow
    └── airbnb_pipeline.py
    └── airbnb_model_pipeline.py
── data
    └── airbnb
── extra
    └── deequ-1.0.2.jar
    └── testproject.jar
── src
    └── main
        └── scala
            └── task
                └── common
                    └── Constants
                    └── DataQaulity
                    └── SparkUtils
                └── main
                    └── DataIngestor 
                    └── ListingsModelDataPreparation 
                    └── ListingsTablePreprocessor 
                    └── MLListingsModel
    └── test
        └── scala
            └── task
                └── common
                    └── SparkUtilsTests
── build.sbt
── .gitignore
── config.sh
── ReadMe.md

TODOs

  • Implementing the ml part properly
  • All the parts should be more dynamic. No hard code!
  • Write more test case
  • dockerising the project
  • Add night mode ;)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published