Youtube Trending video Analytics With AWS

Overview

This project aims to securely manage, streamline, and analyze the structured and semi-structured YouTube video data based on the video categories and trending metrics.

Using AWS cloud Computing platform to build an End-to-End data pipeline for processing and storage, connect with Microsoft PowerBi as BI Tools to build the reports.

Dataset Trending YouTube Video Statistics

This Kaggle dataset contains statistics (CSV files) on popular daily YouTube videos over the course of many months. There are up to 200 trending videos published every day for many locations. The data for each region is in its own file. The video title, channel title, publication time, tags, views, likes and dislikes, description, and comment count are among the items included in the data. A category_id field, which differs by area, is also included in the JSON file linked to the region.

Find more about the dataset Trending YouTube Video Statistics

Tools, and Services used

Amazon S3: Amazon S3 is an object storage service that provides manufacturing scalability, data availability, security, and performance.

It is used as the staging layer for our raw data
Data Lake for raw, cleansed, and analytical data files

AWS Glue: A serverless data integration service that makes it easy to discover, prepare, and combine data for analytics

Used as the Data integration tool for our project.
Discover and build our raw, cleansed, and analytical metadata.
Create a Catalog for our data lake so we can easily query our data using AWS Athena.

AWS Athena: Athena is an interactive query service for S3 in which there is no need to load data it stays in S3.

Query our data from the data lake
Put a structure over our data lake to query our data easily and in an efficient way.

AWS Lambda: Lambda is a computing service that allows programmers to run code without creating or managing servers.

Extract, Transform our data files after they land in our raw data area.
Make the ETL process automated using lambda triggers.
Configure it to load the data from the raw area, clean it, and move it to the cleansed data area in our data lake.

Microsoft PowerBi: Serve as our BI tool to build our reports.

Connect to AWS Athena to load the data from our analytical storage area to build our reports.
Provide a secure connection, and automated reporting to enhance the decision-making process.

AWS IAM: Identity and access management which enables us to manage access to AWS services and resources securely.

Define different roles for different users.
Allow managing data governance and Access management of different layers in our AWS S3 storage.

Data Infrastructure and Architecture

Source Systems and Data Acquisition

Build a bash script to get the data through the Kaggle API and Push it to the AWS S3 data lake raw area storage using AWS CLI. file dataTransfer.sh

Datalake Architecture

Naming Conventions

Project-based/[raw, curated]/region/environment type [dev, production]/data partitioning < source/type/ region>

ex: data_engineering_youtube_analytics/raw/us-east-1/dev/youtube/raw_statistics/region=ca/

Layers

Raw Area

Staging area to store the data coming from different sources in raw format.

Cleansed Area

Store the cleansed Data after processing using AWS Lambda.

Analytics Area (Curated Area)

Store-ready analytics data to be queried directly using Athena and Exported to Microsoft Power Bi.

Data Processing Layer

Process the data on arrival to the raw data area using AWS Lambda *Using AWS Lambda function to build automated data cleaning process adding trigger to call the function whenever any new file arrive in the raw area and redirect it to the next layer

lambda function file youtube_analytics_json_to_parquet.py

Build a Data Catalog above the cleansed Data to query it using Athena

Using AWS GLUE To build a structure layer over the cleansed data and store it in the cleansed database to query using Athena.
Build a pyspark job to
1. Drop Nulls
2. Change Data types
3. Filter data based on the region
4. Schema Mapping
5. Save the output to the cleansed database

file youtube_analytics_cleansed_etl_csv_to_parquet.py

Join Different Tables to Build the analytical Table

Build a pyspark job to build the analytical table and load it to the analytical database
1. Join Tables
2. Apply schema mapping
3. Build analytical data and load it into an analytical database

file etl_final_analytics_table.py

BI Analytics and Reporting layer

Query the data Explore, and Export it to Microsoft PowerBI using AWS Athena file athena_sql_queries.sql
Build Initial Report Using Microsoft PowerBi

Summarization of Trending Content Categories based on each country
Views Analytics
Likes, Dislikes, and Comments Analytics

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
athena_sql_queries.sql		athena_sql_queries.sql
dataTransfer.sh		dataTransfer.sh
etl_final_analytics_table.py		etl_final_analytics_table.py
requirements.txt		requirements.txt
youtube_analytics_cleansed_etl_csv_to_parquet.json		youtube_analytics_cleansed_etl_csv_to_parquet.json
youtube_analytics_cleansed_etl_csv_to_parquet.py		youtube_analytics_cleansed_etl_csv_to_parquet.py
youtube_analytics_json_to_parquet.py		youtube_analytics_json_to_parquet.py
youtube_analytics_reporting.pbix		youtube_analytics_reporting.pbix
youtube_analytics_reporting.pdf		youtube_analytics_reporting.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Youtube Trending video Analytics With AWS

Overview

Dataset Trending YouTube Video Statistics

Tools, and Services used