This repository contains the code and documentation for the Big Data Analytics module (UEL-CN-7031) project.
data/
: Contains the dataset files.notebooks/
: Jupyter notebooks for each task.scripts/
: Shell and Python scripts for data processing and analysis.reports/
: Final report and presentation files.visuals/
: Visualizations and plots generated during the analysis.docs/
: Additional documentation.
- Understanding Dataset
- Big Data Query & Analysis by Apache Hive
- Advanced Analytics using PySpark
- Individual Assessment
-
Clone the repository:
git clone https://github.com/Kyeyuneashiraf/big-data-analytics-project.git cd big-data-analytics-project
-
Follow the instructions in the
notebooks/
directory to execute the tasks.
-
Run the shell script to load data into HDFS:
./scripts/load_data_to_hdfs.sh
-
Execute the Hive queries:
hive -f scripts/hive_queries.sql
-
Run the PySpark analysis:
spark-submit scripts/pyspark_analytics.py
This project is licensed under the MIT License.