Clustering of Taxi Trip Data using Spark

Dataset

The dataset is the 2015 Yellow Taxi Trip Data. It includes trip records from all trips completed in yellow taxis from in NYC from January to June in 2015. Due to limited resources, we used only a 2 GB subset of the dataset. This subset contains 13m trip records and are available here. In this file, two comma-delimited text files (.csv) are available. The first contains all the necessary information about a route and the second contains information about the taxi vendors.

Algorithm

The scope of the project is to find the coordinates of the top 5 pickup locations. In order to achieve this, we implemented the K-means with k=5, that clusters the pickup locations in five regions.

Requirements

numpy
simplekml

Usage

We assume that Spark and HDFS are already installed in our system.

Upload data in hdfs

hadoop fs -put ./yellow_tripdata_1m.csv hdfs://master:9000/yellow_tripdata_1m.csv

Install necessary requirements

pip install -r requirements.txt

Submit the job in a Spark environment

spark-submit kmeans.py

Get the results from the hdfs and print them

hadoop fs -getmerge hdfs://master:9000/kmeans.res ./kmeans.res
cat kmeans.res 

[-74.33685886  40.71401562]
[-73.84222159  40.71854692]
[-73.99625567  40.71627292]
[-73.99097947  40.74444285]
[-73.96875398  40.77099479]

Convert them in kml form

python create_kml.py

Results

Project Structure

kmeans.py: Runs the kmeans algorithm using Spark Map Reduce jobs.
create_kml.py: Creates a kml file from the output.
kmeans.res: Coordinates of the final centers.
report.pdf: Report of the project in Greek
description.pdf: Project description in Greek

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
docs		docs
README.md		README.md
create_kml.py		create_kml.py
description.pdf		description.pdf
kmeans.kml		kmeans.kml
kmeans.png		kmeans.png
kmeans.py		kmeans.py
kmeans.res		kmeans.res
report.pdf		report.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clustering of Taxi Trip Data using Spark

Dataset

Algorithm

Requirements

Usage

Results

Project Structure

About

Releases

Packages

Contributors 2

Languages

PanosAntoniadis/map_reduce-ntua

Folders and files

Latest commit

History

Repository files navigation

Clustering of Taxi Trip Data using Spark

Dataset

Algorithm

Requirements

Usage

Results

Project Structure

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages