Skip to content

Predicting taxi ride duration in Manhattan from selected features such as day of week and pickup location.

Notifications You must be signed in to change notification settings

meme2515/taxi_rides

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚕 Taxi Ride Duration Prediction

taxi_ride.ipynb contains code used to predict the duration of taxi rides in Manhattan island using data published by NYC Taxi and Limousine Commission. I have mainly used the Python pandas and sklearn libraries, where the best outcome was achieved with tree regression. The learning takes place in four distinct parts.

1. Data Selection and Cleaning

After collecting a random sample of the original taxi rides data into a SQLite database (placed inside the data folder), I have queried only the rides that occurred within the boundaries of New York City. I have then proceeded to filter the data to exclude abnormal rides (rides with negative number of passengers, duration longer than one hour, etc.) and keep only rides that happened on Manhattan island using the point-in-polygon algorithm.

2. Exploratory Data Analysis 📈

Because the data is limited to the month of January, I have found that new year's day, MLK day, and a blizzard in 2016 significantly affected the number of taxi users. I analyzed fluctuations in taxi users accordingly to exclude abnormal dates from the training data.

3. Feature Engineering

I have selected start location, end location, trip distance, time of day, and day of the week from the original data as features for prediction, and created a feature matrix using the sklearn package. Because it was difficult to directly use lat-lon location as a feature, I have used the PCA algorithm to divide the rides into three groups: Lower, Midtown, and Upper Manhattan. The resulting features also indicate the speed of the ride and whether or not the ride happened in the weekend.

4. Model Selection 🌲

In this part I have run various models such as constant prediction, linear regression, and tree regression and compared the root mean squared error of each model's prediction. As a result, I have found that tree regression outperforms all of the other models. It was also true that predicting speed directly resulted in lower rmse. This was useful because I could derive the ride duration by combining speed and distance (which was already provided as a feature).

About

Predicting taxi ride duration in Manhattan from selected features such as day of week and pickup location.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published