Analysis of NYC taxi data and a predictive model on tip percentage of total fare
This project aims to solving the following questions. There are 3 parts in this project: Analysis, Predictive Model, OptionA.
This coding challenge is designed to test your skill and intuition about real world data. For the challenge, we will use datacollected by the New York City Taxi and Limousine commission about “Green” Taxis. Green Taxis (as opposed to yellow ones) are taxis that are not allowed to pick up passengers inside of the densely populated areas of Manhattan. We will use the data from September 2015. We are using NYC Taxi and Limousine trip record data: (http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml).
Required Questions: Please answer completely all five required questions.
-
Programmatically download and load into your favorite analytical tool the trip data for September 2015.
-
Report how many rows and columns of data you have loaded.
• Plot a histogram of the number of the trip distance (“Trip Distance”).
• Report any structure you find and any hypotheses you have about that structure.
• Report mean and median trip distance grouped by hour of day.
• We’d like to get a rough sense of identifying trips that originate or terminate at one of the NYC area airports. Can you provide a count of how many transactions fit this criteria, the average fare, and any other interesting characteristics of these trips.
• Build a derived variable for tip as a percentage of the total fare.
• Build a predictive model for tip as a percentage of the total fare. Use as much of the data as you like (or all of it). Provide an estimate of performance using an appropriate sample, and show your work.
Choose only one of these options to answer for Question 5. There is no preference as to which one you choose. Please select the question that you feel your particular skills and/or expertise are best suited to. If you answer more than one, only the first will be scored.
• Build a derived variable representing the average speed over the course of a trip.
• Can you perform a test to determine if the average trip speeds are materially the same in all weeks of September? If you decide they are not the same, can you form a hypothesis regarding why they differ?
• Can you build up a hypothesis of average trip speed as a function of time of day?
• Can you build a visualization (interactive or static) of the trip data that helps us understand intra- vs. inter-borough traffic? What story does it tell about how New Yorkers use their green taxis?
• We’re thinking about promoting ride sharing. Build a function that given point a point P, find the k trip origination points nearest P.
– For this question, point P would be a taxi ride starting location picked by us at a given LAT-LONG.
– As an extra layer of complexity, consider the time for pickups, so this could eventually be used for real time ride sharing matching.
– Please explain not only how this can be computed, but how efficient your approach is (time and space complexity)
• What anomalies can you find in the data? Did taxi traffic or behavior deviate from the norm on a particular day/time or in a particular location?
• Using time-series analysis, clustering, or some other method, please develop a process/methodology to identify out of the norm behavior and attempt to explain why those anomalies occurred.
• If the data leaps out and screams some question of you that we haven’t asked, ask it and answer it! Use this as an opportunity to highlight your special skills and philosophies.