Skip to content

Latest commit

 

History

History
41 lines (33 loc) · 1.61 KB

README.md

File metadata and controls

41 lines (33 loc) · 1.61 KB

MusicBoxRetentionAnalysis

The reading sequence is 0_create_data_folders.sh
1_download_data.ipynb
2_unpack_and_clean_files.sh
3_etl_down_sample_by_user.ipynb
4_EDA_with_spark.ipynb
5_feature_engineer_with_spark.ipynb
6_train_model_sklearn.ipynb

Summary of data analysis
Part zero: create_data_folders.sh
■ Create data folder for data downloading

Part one: download_data.ipynb
■ Down load data from AWS

Part two: unpack_and_clean_files.sh
■ Un-compress the downloaded data and combine them to each category.

Part three: etl_down_sample_by_user.ipynb
■ Removed the robots users from data set based on unusual music play frequency.
■ Applied down sampling method to cut data size in half.
■ Introduced the data structure and content.

Part four: EDA_with_spark.ipynb
Explored the following questions:
■ What's the user activity from 04/01 to 05/12?
■ What's the device distribution?
■ What's the percentage of paid songs?
■ Why do users stop playing songs before it ends?
■ What's the weekly retention rate for users?

Part five: feature_engineer_with_spark.ipynb
■ Created and engineered features for retention analysis.

Part six: train_model_sklearn.ipynb
■ Based on generated feature, built logistic regression, random forest, GBDT and neural network to predict monthly retention behavior.
■ Compared the performance of each model through cross-validation.
■ Usd grid-search to fine tuning random forest model due to its good performance.