This project was completed as the team project at YCBS 257 Data at Scale class in Professional Development Certificate Program in Data Science and Machine Learning at McGill University, and the project introduced the MapReduce functions for solving the problems with Big Data.
Based on the given flight data, our team extracted the necessary data and calculated the flight distances between Beijing and data points, using MapReduce.
This repository has the following main directories and files:
- data: flight data with JSON strings
- output files: outputs from the mapper and reducer programs, and from the process of creating the CSV output file
- images: diagrams created for summary of processes and components
- mapper.ipynb: mapper program in Python
- reducer.ipynb: reducer program in Python
- PPT Slides.pdf: the group presentation ppt slides
The project was completed in the following five steps:
- Step 1. Input Dataset: Verify and review the input dataset (i.e. data type, format and structure before processing)
- Step 2. Build Mapper: Write a mapper program which takes out all flights ids that have the position messages only, the clock, ident and latitude and longitude
- Step 3. Build Reducer: Write a reducer program which takes the last position of the flight and calculates its distance to Beijing
- Step 4. Create CSV List: From the reducer output file, produce a CSV list of all flights (ident, id, and distance to Beijing) sorted by closest to furthest to Beijing
- Step 5. Data Analysis: Analyze the output file with sorted flight distances and summarize the results of the analysis
The mapper program was developed, based on the input dataset of 19,404 JSON strings in a text file. The reducer program was developed from the sorted output file from the mapper program, which was also a text file with JSON strings.
The detail of the input dataset was as follows:
Our team used the following Haversine formula to calculate the distances between Beijing and data points in the given dataset, which were in latitudes and longitudes:
The Haversine formula was broken down to three sections and implemented into the reducer program for calculating the distances as follows:
The mapper program produced the flight data in JSON objects, which were mapped and sorted by key-value pairs (text file: 'sorted_mapped_flight_data.txt'), which was further passed to the reducer program as its input. The reducer program produced 9,747 JSON strings in a text file, which were reduced to 'id', 'ident' and 'distance' fields, where 'distance' was calculated from 'latitude' and 'longitude' data by the Haversine function within the reducer program (text file: 'reduced_flight_data.txt').
The final output was the CSV file (CSV file: 'flight_list_sorted_by_distance.csv') with the flight data sorted by the flight distance in the ascending order, after the reducer output text file with JSON objects was transformed to a Pandas dataframe and the sorted flight list in the dataframe was exported to a CSV file.
Additional data analysis was conducted in Jupyter Notebook (Jupyter Notebook file: 'reducer.ipynb'), and our team had the following results: