Skip to content

Latest commit

 

History

History
265 lines (236 loc) · 6.71 KB

README.md

File metadata and controls

265 lines (236 loc) · 6.71 KB

Youtube Data Analysis

Big data analysis on youtubers based on increase and decrease of subscribers and comments during the increase and decrease period.

Python YoutubeAPI License: AGPL v3

Websites Used

Motivation

Many dream of becoming a famous Youtuber these days. Youtube contents itself, for sure, influences a channel's fame. However, Youtube users writing comments also influence a channel's popularity. Therefore, this project analyze top 30 popularity increase and decrease channels and compare the comments during the increase and decrease period.

Category

Since the comments may vary drastically depending on which category the youtubers are in, the Youtube channels are split into six different categories which are:

  • Autos & Vehicles
  • Entertainment
  • Gaming
  • How to & Style
  • Science & Technology
  • Travel & Events

Conclusion

In the experiment, ratios and z-scores are calculated with 60 channels for each postivie/negative status in six different categories. The test is done using 30 channels for each positive/negative status in six different categories as well.

Test results

Present ~ Autos & Vehicles Entertainment Gaming How to & Style Science & Technology Travel & Events
Duplicate Ratio 53.85% 72.73% 56.86% 50.91% 52.00% 52.83%
Z-Score 57.69% 72.73% 56.86% 50.91% 52.00% 52.83%
No Duplicate Ratio 53.85% 59.09% 56.86% 50.91% 52.00% 52.83%
Z-Score 55.77% 59.09% 56.86% 50.91% 54.00% 52.83%
Present ~ 1 Week Ago Autos & Vehicles Entertainment Gaming How to & Style Science & Technology Travel & Events
Duplicate Ratio 45.45% 59.09% 52.94% 45.45% 48.00% 50.94%
Z-Score 48.08% 59.09% 52.94% 47.27% 50.00% 50.94%
No Duplicate Ratio 48.08% 59.09% 52.94% 45.45% 48.00% 50.94%
Z-Score 48.08% 59.09% 52.94% 47.27% 50.00% 50.94%
1 Week Ago ~ 2 Week Ago Autos & Vehicles Entertainment Gaming How to & Style Science & Technology Travel & Events
Duplicate Ratio 21.15% 13.64% 15.69% 21.82% 32.00% 39.62%
Z-Score 19.23% 22.73% 11.76% 20.00% 34.00% 39.62%
No Duplicate Ratio 21.15% 15.91% 15.69% 21.82% 32.00% 39.62%
Z-Score 19.23% 22.73% 11.76% 20.00% 34.00% 39.62%

How it Works

🛑 Watch out

In order to run this code, you must get a Youtube API key from Google Developer console and have the key as API_KEY as an environment variable.

0️⃣ Dependency Installation

pip install -r requirements.txt

1️⃣ Web Scrape

Web scrape statistics of top 30 increase and decrease categories.

python web_scrape.py

2️⃣ Youtube API

Query maximum of 5 most recent videos and get 100 the comments and statistics.

🛑 Watch Out

Youtube only allows people to use 10,000 units/day. If you do not have additional permission, you must fix the code so it gets data for a single category at a time.

python api_query.py

3️⃣ Preprocess

Preprocess data by doing the followings:

  • Tokenize comments
  • Remove punctuation
  • Keep English only
  • Remove stopwords
  • Extract word stem
  • Count words
  • Ratio
  • Z-Score
python preprocess_data.py

5️⃣ Visualization

Visualize data by the followings:

  • Wordcloud
  • Horizontal Bar Graph
  • Vertical Bar Graph
python visualize.py

6️⃣ Test

Test the accuracy of calculated ratio and z-score.

🛑 Watch Out

You must re-do step 1️⃣ and step 2️⃣ to collect data for testing first

python test.py