Team Members
: Shengzhao LEI, Tao Sun, Xiangzhe Meng
Project Data Story Link: Amazon review data analysis
Applied data analysis course teaches the basic techniques and practical skills required to make sense out of a variety of data, with the help of the most acclaimed software tools in the data science world: pandas, scikit-learn, Spark, etc.
This course covers the fundamental steps of the data science pipeline:
- Data Acquisition
Variety as one of the main challenges in big data: structured, semi-structured, unstructured; Data sources: open, public (scraping, parsing, and down-sampling); Dataset fusion, filtering, slicing & dicing; Data granularities and aggregations
- Data Wrangling
Data manipulation, array programming, dataframes; The many sources of data problems (and how to fix them): missing data, incorrect data, inconsistent representations; Schema alignment, data reconciliation; Data quality testing with crowdsourcing
- Data Interpretation
Distribution fitting, statistical significance; Co-occurrence grouping (market-basket analysis); Machine learning in practice (supervised and unsupervised, feature engineering, more data vs. advanced algorithms, curse of dimensionality, etc.); Text mining: vector space model, topic models, word embedding; Social network analysis (influencers, community detection, etc.)
-
Data Visualization
-
Reporting