-
Notifications
You must be signed in to change notification settings - Fork 0
/
report.Rmd
412 lines (309 loc) · 18.4 KB
/
report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
---
title: "Forecasting Lyft Fares Using Linear Regression"
author: "Jessica Padilla"
date: "December 13, 2019"
output: pdf_document
---
## Introduction
Lyft is one of the many rideshare services available to people today. Founded in 2012, Lyft allows users to order rides through the use of a phone app. There are 6 types of rides available: Shared Ride, Lyft, Lyft XL, Lux, Lux Black, and Lux Black XL (Lyft). This project will focus on the basic Lyft service, which is the most commonly used ride.
Fares are generally determined by the distance for each ride. However, Lyft and other rideshare companies have adopted the practice of what is called price surging. This involves increasing the price of a ride because the demand for Lyft cars exceeds the number of Lyft cars that are actually in service (Lyft). This can result from many conditions such as densely populated locations, bad weather, and busy commute hours within the day. In this project, we will utilize a data set of Lyft rides within Boston, Massachusetts to see what factors contribute to price surges. We will, then, use the determined factors to forecast the price of any Lyft ride.
## Methods
Lyft data was retrieved from Kaggle (https://www.kaggle.com/sliu65/data-mining-project-boston) and uploaded onto GitHub (https://github.com/jessicapadilla/lyft_fares/blob/master/lyft.csv.zip?raw=true). All of the data was imported into R. The tidyverse package was used for data cleaning, analysis, and visualization. The ggcorrplot and knitr packages was also used to enhance visualization.
The R code for this project can be found within the Supplementary Materials section of this paper and on GitHub (https://github.com/jessicapadilla/lyft_fares/blob/master/code.R).
## Results
Lyft fare price surging is affected by what is termed the surge multiplier. If the surge multiplier is equal to 1, then it means that there are enough Lyft cars in the area to meet user demand and, therefore, the fare remains unchanged. If the surge multiplier is greater than 1, then there are more users than Lyft cars in service. As a result, the standard fare is increased by the surge multiplier. For this particular dataset, which focuses on the city of Boston, the most frequent surge multiplier is 1.25.
```{r, data, echo = FALSE, message = FALSE, warning = FALSE, results = 'hide', fig.height = 2.8}
## load libraries
library(tidyverse)
library(ggcorrplot)
library(knitr)
## assign the link for the lyft zip folder
lyft_url <- "https://github.com/jessicapadilla/lyft_fares/blob/master/lyft.csv.zip?raw=true"
## create a temporary file so the data can be downloaded into it
lyft_temp <- tempfile()
## download the zip folder containing the data
download.file(lyft_url, lyft_temp, mode = "wb")
## unzip the zip folder
unzip(lyft_temp, "lyft.csv")
## read the csv file within the zip folder
lyft <- read.csv("lyft.csv", sep = ",", header = TRUE)
## check the structure of the data
str(lyft)
## check if the visibility and visibility.1 columns are the same
identical(lyft$visibility, lyft$visibility.1)
## convert the data into a tibble
## remove unnecessary columns
## rename the remaining columns
lyft <- as_tibble(lyft) %>% select(-c(id, visibility.1)) %>%
rename(company = cab_type,
car_type = name,
apparent_temperature = apparentTemperature,
weather_summary = short_summary,
precip_intensity = precipIntensity,
precip_probability = precipProbability,
wind_speed = windSpeed,
temp_high = temperatureHigh,
temp_low = temperatureLow,
dew_point = dewPoint,
wind_bearing = windBearing,
cloud_cover = cloudCover,
uv_index = uvIndex,
moonphase = moonPhase)
## convert the integer columns to numeric
lyft[, c(1:3, 22, 24)] <- sapply(lyft[, c(1:3, 22, 24)], as.numeric)
## filter the data to only include the basic lyft services
lyft <- lyft %>% filter(car_type == "Lyft")
## condense the categories in the weather summary column
lyft <- lyft %>% mutate(weather_summary = case_when(
weather_summary == " Clear " ~ "Clear",
weather_summary %in% c(" Overcast ", " Mostly Cloudy ", " Partly Cloudy ", " Possible Drizzle ") ~ "Cloudy",
weather_summary %in% c(" Light Rain ", " Rain ", " Drizzle ") ~ "Rainy",
weather_summary == " Foggy " ~ "Foggy"))
## reorder the types of weather conditions
lyft$weather_summary <- factor(lyft$weather_summary,
levels = c("Cloudy", "Clear", "Rainy", "Foggy"))
## create a graph to show how frequent price surging occurs
lyft %>% filter(surge_multiplier > 1) %>%
ggplot(aes(surge_multiplier)) +
geom_histogram(binwidth = 0.25, col = "black", fill = "purple") +
scale_x_continuous(breaks = seq(1.25, 3.00, 0.25)) +
coord_cartesian(ylim = c(0, 2250)) +
xlab("Surge Multiplier") + ylab("Count") +
ggtitle("Frequency of Price Surging") +
theme_bw()
```
When surge multipliers are plotted against each Lyft ride's starting location, we see that the amount of surge multipliers is highest in the Boston district of Back Bay, which can be as high as roughly 3.00. Haymarket Square, on the other hand, has the fewest price surges and never goes above 1.75.
```{r, starting location, echo = FALSE, message = FALSE, warning = FALSE, results = 'hide', fig.height = 3.5}
## create a graph to show how surge multiplier varies by source location
## exclude surge_multiplier of 1 since it doesn't cause fare to increase
lyft %>% filter(surge_multiplier > 1) %>%
mutate(source = reorder(source, -surge_multiplier)) %>%
ggplot(aes(source, surge_multiplier, col = source)) +
geom_jitter(alpha = 0.2, size = 2, width = 0.2) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +
scale_y_continuous(breaks = seq(1.25, 3.00, 0.25)) +
xlab("Starting Location") + ylab("Surge Multiplier") +
ggtitle("Surge Multiplier Based On Starting Location") +
theme_bw() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 90, hjust = 1))
```
Thw weather can also affect surge multipliers. The number of rides that are affected by surge pricing are highest when weather conditions are cloudy. Furthermore, when the weather is either cloudy or rainy, surge multipliers can reach up to 3.00. It is interesting to note that clear weather conditions have nearly the same pattern of surge pricing as rainy conditions, indicating that there are other variables that impact surge multipliers.
```{r, weather, echo = FALSE, message = FALSE, warning = FALSE, results = 'hide', fig.height = 3.5}
## create a graph to show how the surge multiplier changes with weather
lyft %>% filter(surge_multiplier > 1) %>%
group_by(surge_multiplier) %>% count(weather_summary) %>%
ggplot(aes(surge_multiplier, n, col = purple)) +
geom_bar(stat = "identity", fill = "purple", color = "black") +
facet_wrap(. ~ weather_summary) +
scale_x_continuous(breaks = seq(1.25, 3.00, 0.25)) +
xlab("Surge Multiplier") + ylab("Count") +
ggtitle("Surge Pricing During Different Weather Conditions") +
theme_bw()
```
We, then, created a correlogram for the surge multiplier and other numeric variables of interest in the dataset. This allowed us to visualize any possible correlations between the variables through the use of color and a numeric scale.
If variables are positively correlated, then the intersecting box between variables in the graph would be a shade of red. If the variables are negatively correlated, then the intersecting box in the graph would be shade of purple. The darkness of the color represents the strength of the relationship: the darker the color, the stronger the relationship.
If there is a minus sign in front of the correlation coefficient, then there is a negative linear relationship (when one variable increases, the other decreases). If there is no sign, then there is a positive linear relationship (when one variable increases, the other increases).
The strength of the correlation coefficient is numerically defined as follows ("How To Interpret A Correlation Coefficient R"):
* 1.0: A perfect correlation.
* Between 0.7 to 1.0: A strong correlation.
* Between 0.5 to 0.7: A moderate correlation.
* Between 0.3 to 0.5: A weak correlation.
* 0: No correlation.
```{r, correlogram, echo = FALSE, message = FALSE, warning = FALSE, results = 'hide', fig.height = 4.5}
## revise the data frame to only include numeric values of interest
lyft_num <- lyft %>% select(c(surge_multiplier, hour, day,
price, distance,
temperature, precip_intensity,
humidity, wind_speed,
visibility, cloud_cover))
## round the data to one decimal place
corr <- round(cor(lyft_num), 1)
## create a correlogram
ggcorrplot(corr, type = "lower",
lab = TRUE, lab_size = 3.5,
title = "Correlogram Of Variables",
ggtheme = theme_bw())
```
The correlogram indicates that there is a moderate relationship between the surge multiplier and price, which have a correlation coefficient of 0.5. There is a strong correlation between price and distance, which have a correlation coefficient of 0.8. It is also clear that Lyft fares are not affected by the hour in the day, the day within the week, or other specific weather variables.
We built a linear regression model (see Supplementary Materials section for code) by splitting the data set into a training set (90% of the data) and a test set (10% of the data). The price of the Lyft ride was forecasted by using the surge multiplier, distance, source location, and weather summary. This model generated a multiple R-squared value of 0.845, which means it explains 84.5% of the variation in Lyft prices. In addition, the correlation accuracy of this model is 0.92, which means that the predicted prices of Lyft rides are very similar to the actual prices since its value is close to 1. This can be seen by examining the first few lines of data generated by this model in the table below.
```{r, model, echo = FALSE, message = FALSE, warning = FALSE, results = 'hide'}
## set the seed
set.seed(123, sample.kind = "Rounding")
## there are 51235 observations
## split the data into training and testing sets
## set 90% of the data for the training set (46111 observations)
## set the remaining 10% of the data for the test set (5124 observations)
## randomize the data by first creating a vector of random integers
train_sample <- sample(51235, 46111)
## use the vector of random integers to randomly select observations from the data
## create the training set
lyft_train <- lyft[train_sample, ]
## create the test set
lyft_test <- lyft[-train_sample, ]
## create a linear regression model using the training set
lyft_model <- lm(price ~ surge_multiplier + distance +
source + weather_summary, data = lyft_train)
## check the summary to evaluate the model's performance
summary(lyft_model)
## use the linear regression model on the test set
lyft_pred <- predict(lyft_model, lyft_test)
## create a data frame to compare the predicted and the actual values
actual_predicted <- data.frame(cbind(Actual = lyft_test$price, Predicted = lyft_pred))
## check the correlation accuracy between the predicted and the actual values
correlation_accuracy <- cor(actual_predicted)
correlation_accuracy
```
```{r, table, echo = FALSE, message = FALSE, warning = FALSE}
## check the first few rows of the predicted and the actual values
kable(head(actual_predicted),
caption = "Comparing the Actual Prices to the Predicted Prices", digits = 1)
```
## Conclusion
Price surging is a very common practice of rideshare companies. It occurs when the demand for rides exceeds the number of cars available for service. By analyzing a dataset of Lyft rides for the city of Boston, several variables were found to impact price surging, such as weather conditions and a ride's starting location. Using these findings plus the correlation of price and distance, we were able to build a linear regression model that forecasts Lyft prices with an accuracy of 0.92. Although this accuracy is very acceptable since it is close to a value of 1, this model could still potentially be improved by collecting more data throughout several locations and by including important events in each location (sporting events, concerts, etc.) which may affect the need for Lyft rides.
## References
"Lyft." *Lyft*, Lyft, Inc., 2019, lyft.com.
Rumsey, Deborah. "How to Interpret a Correlation Coefficient r." *Dummies: A Wiley Brand*, John Wiley & Sons, Inc., 2019, dummies.com/education/math/statistics/how-to-interpret-a-correlation-coefficient-r.
## Supplementary Material
This data science project was completed with the use of the R code below.
```{r, eval = FALSE}
## load libraries
library(tidyverse)
library(ggcorrplot)
library(knitr)
## assign the link for the lyft zip folder
lyft_url <- "https://github.com/jessicapadilla/lyft_fares/blob/master/
lyft.csv.zip?raw=true"
## create a temporary file so the data can be downloaded into it
lyft_temp <- tempfile()
## download the zip folder containing the data
download.file(lyft_url, lyft_temp, mode = "wb")
## unzip the zip folder
unzip(lyft_temp, "lyft.csv")
## read the csv file within the zip folder
lyft <- read.csv("lyft.csv", sep = ",", header = TRUE)
## check the structure of the data
str(lyft)
## check if the visibility and visibility.1 columns are the same
identical(lyft$visibility, lyft$visibility.1)
## convert the data into a tibble
## remove unnecessary columns
## rename the remaining columns
lyft <- as_tibble(lyft) %>% select(-c(id, visibility.1)) %>%
rename(company = cab_type,
car_type = name,
apparent_temperature = apparentTemperature,
weather_summary = short_summary,
precip_intensity = precipIntensity,
precip_probability = precipProbability,
wind_speed = windSpeed,
temp_high = temperatureHigh,
temp_low = temperatureLow,
dew_point = dewPoint,
wind_bearing = windBearing,
cloud_cover = cloudCover,
uv_index = uvIndex,
moonphase = moonPhase)
## convert the integer columns to numeric
lyft[, c(1:3, 22, 24)] <- sapply(lyft[, c(1:3, 22, 24)], as.numeric)
## filter the data to only include the basic lyft services
lyft <- lyft %>% filter(car_type == "Lyft")
## condense the categories in the weather summary column
lyft <- lyft %>% mutate(weather_summary = case_when(
weather_summary == " Clear " ~ "Clear",
weather_summary %in% c(" Overcast ", " Mostly Cloudy ", " Partly Cloudy ",
" Possible Drizzle ") ~ "Cloudy",
weather_summary %in% c(" Light Rain ", " Rain ", " Drizzle ") ~ "Rainy",
weather_summary == " Foggy " ~ "Foggy"))
## reorder the types of weather conditions
lyft$weather_summary <- factor(lyft$weather_summary,
levels = c("Cloudy", "Clear", "Rainy", "Foggy"))
## create a graph to show how frequent price surging occurs
lyft %>% filter(surge_multiplier > 1) %>%
ggplot(aes(surge_multiplier)) +
geom_histogram(binwidth = 0.25, col = "black", fill = "purple") +
scale_x_continuous(breaks = seq(1.25, 3.00, 0.25)) +
coord_cartesian(ylim = c(0, 2250)) +
xlab("Surge Multiplier") + ylab("Count") +
ggtitle("Frequency of Price Surging") +
theme_bw()
## create a graph to show how surge multiplier varies by source location
## exclude surge_multiplier of 1 since it doesn't cause fare to increase
lyft %>% filter(surge_multiplier > 1) %>%
mutate(source = reorder(source, -surge_multiplier)) %>%
ggplot(aes(source, surge_multiplier, col = source)) +
geom_jitter(alpha = 0.2, size = 2, width = 0.2) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +
scale_y_continuous(breaks = seq(1.25, 3.00, 0.25)) +
xlab("Starting Location") + ylab("Surge Multiplier") +
ggtitle("Surge Multiplier Based On Starting Location") +
theme_bw() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 90, hjust = 1))
## create a graph to show how the surge multiplier changes with weather
lyft %>% filter(surge_multiplier > 1) %>%
group_by(surge_multiplier) %>% count(weather_summary) %>%
ggplot(aes(surge_multiplier, n, col = purple)) +
geom_bar(stat = "identity", fill = "purple", color = "black") +
facet_wrap(. ~ weather_summary) +
scale_x_continuous(breaks = seq(1.25, 3.00, 0.25)) +
xlab("Surge Multiplier") + ylab("Count") +
ggtitle("Surge Pricing During Different Weather Conditions") +
theme_bw()
## revise the data frame to only include numeric values of interest
lyft_num <- lyft %>% select(c(surge_multiplier, hour, day,
price, distance,
temperature, precip_intensity,
humidity, wind_speed,
visibility, cloud_cover))
## round the data to one decimal place
corr <- round(cor(lyft_num), 1)
## create a correlogram
ggcorrplot(corr, type = "lower",
lab = TRUE, lab_size = 3.5,
title = "Correlogram Of Variables",
ggtheme = theme_bw())
## set the seed
set.seed(123, sample.kind = "Rounding")
## there are 51235 observations
## split the data into training and testing sets
## set 90% of the data for the training set (46111 observations)
## set the remaining 10% of the data for the test set (5124 observations)
## randomize the data by first creating a vector of random integers
train_sample <- sample(51235, 46111)
## use the vector of random integers to randomly select observations from the data
## create the training set
lyft_train <- lyft[train_sample, ]
## create the test set
lyft_test <- lyft[-train_sample, ]
## create a linear regression model using the training set
lyft_model <- lm(price ~ surge_multiplier + distance +
source + weather_summary, data = lyft_train)
## check the summary to evaluate the model's performance
summary(lyft_model)
## use the linear regression model on the test set
lyft_pred <- predict(lyft_model, lyft_test)
## create a data frame to compare the predicted and the actual values
actual_predicted <- data.frame(cbind(Actual = lyft_test$price, Predicted = lyft_pred))
## check the correlation accuracy between the predicted and the actual values
correlation_accuracy <- cor(actual_predicted)
correlation_accuracy
## check the first few rows of the predicted and the actual values
kable(head(actual_predicted),
caption = "Comparing the Actual Prices to the Predicted Prices", digits = 1)