Skip to content

Big Data Analytics module (UEL-CN-7031), featuring Hive and PySpark analysis on the UNSW-NB15 dataset, with detailed tasks, scripts, visualizations, and reports

License

Notifications You must be signed in to change notification settings

Kyeyuneashiraf/big-data-analytics-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Big Data Analytics Project

This repository contains the code and documentation for the Big Data Analytics module (UEL-CN-7031) project.

Project Structure

  • data/: Contains the dataset files.
  • notebooks/: Jupyter notebooks for each task.
  • scripts/: Shell and Python scripts for data processing and analysis.
  • reports/: Final report and presentation files.
  • visuals/: Visualizations and plots generated during the analysis.
  • docs/: Additional documentation.

Tasks

  1. Understanding Dataset
  2. Big Data Query & Analysis by Apache Hive
  3. Advanced Analytics using PySpark
  4. Individual Assessment

Setup

  1. Clone the repository:

    git clone https://github.com/Kyeyuneashiraf/big-data-analytics-project.git
    cd big-data-analytics-project
  2. Follow the instructions in the notebooks/ directory to execute the tasks.

Usage

  • Run the shell script to load data into HDFS:

    ./scripts/load_data_to_hdfs.sh
  • Execute the Hive queries:

    hive -f scripts/hive_queries.sql
  • Run the PySpark analysis:

    spark-submit scripts/pyspark_analytics.py

License

This project is licensed under the MIT License.

About

Big Data Analytics module (UEL-CN-7031), featuring Hive and PySpark analysis on the UNSW-NB15 dataset, with detailed tasks, scripts, visualizations, and reports

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published