-
Notifications
You must be signed in to change notification settings - Fork 0
/
R01_markdown_attempt1.Rmd
179 lines (139 loc) · 7.8 KB
/
R01_markdown_attempt1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
---
title: "Caltrain Delays"
output: github_document
html_preview: true
---
```{r setup, echo=FALSE}
knitr::opts_knit$set(root.dir = rprojroot::find_rstudio_root_file())
knitr::opts_chunk$set(message = FALSE, warning = FALSE, echo=F)
```
```{r aggregate, echo=FALSE}
library(tidyverse)
library(lubridate)
library(gridExtra)
tweets <- read_csv("Data/caltrain_tweets_new.csv")
is.even <- function(x) x %% 2 == 0
is.odd <- function(x) x %% 2 != 0
working <- tweets %>%
select(-time_of_day) %>%
complete(date, commute) %>%
filter(commute != "Other") %>%
mutate(weekday = wday(date, label=T)) %>%
filter(weekday != "Sat") %>%
filter(weekday != "Sun") %>%
mutate(month = month(date)) %>%
mutate(delay = ifelse(grepl("delay", text, ignore.case=T) | grepl("-[0-9]" , text),1,0)) %>% # to account for the -XX" thing they do
mutate(major_delay = ifelse(delay == 1 & grepl("-[([2-9]|[0-9])]", text),1,0)) %>% # >20 minutes
mutate(train = as.numeric(str_extract(text, "[0-9]{3}"))) %>%
mutate(direction = ifelse(is.even(train), "SB",
ifelse(is.odd(train), "NB", NA))) %>%
mutate(nb = ifelse(direction == "NB", 1,0)) %>%
mutate(sb = ifelse(direction == "SB", 1,0))
aggdata <- working %>%
group_by(date, commute) %>%
summarise(delay = max(delay), major_delay = max(major_delay), nb = max(nb), sb=max(sb)) %>%
mutate(weekday = wday(date, label=T))
aggdata2 <- working %>%
group_by(date, commute) %>%
summarise(delay = max(delay), major_delay = max(major_delay)) %>%
mutate(weekday = wday(date, label=T))
```
```{r make figures, echo=FALSE}
library(ggthemr)
ggthemr("fresh")
# p1 - all
p1 <- aggdata %>%
ungroup() %>%
summarise(delay = mean(delay), major_delay = mean(major_delay)) %>%
gather("share") %>%
ggplot(., aes(x = share, y=value, fill=as.factor(share))) + geom_bar(stat="identity", width = .5) +
geom_text(aes(label=formatC(value*100, digits = 3)), vjust=1.6, color="white", size=3.5)+
ylab("Proportion of Commutes") +
xlab("") +
ylim(0,1) +
theme(legend.position="none") +
scale_x_discrete(labels = c('Delay','Major Delay')) +
ggtitle("Proportion of Commutes with Delays")
# p2 - by commute
p2 <- aggdata %>%
group_by(commute) %>%
summarise( delay = mean(delay), major_delay = mean(major_delay)) %>%
gather(key ="share", value="value", -commute)%>%
ggplot(., aes(x = commute, y=value, fill=as.factor(share))) + geom_col( width = .6, position = "dodge2") +
geom_text(aes(label=formatC(value*100, digits = 3)), vjust=1.6, color="white", size=3.5, position = position_dodge2(width = .6)) +
ylab("Proportion of Commutes") +
xlab("") +
ylim(0,1) +
scale_fill_discrete(name="",
labels=c("Delay", "Major Delay"))+
ggtitle("Proportion of Morning and Evening
Commutes with Delays")
# p3 - day of the week
p3 <- aggdata %>%
group_by(weekday) %>%
summarise( delay = mean(delay), major_delay = mean(major_delay)) %>%
gather(key ="share", value="value", -weekday) %>%
ggplot(., aes(x = weekday, y=value, fill=as.factor(share))) + geom_col( width = .9, position = "dodge2") +
geom_text(aes(label=formatC(value*100, digits = 3)), vjust=1.6, color="white", size=2, position = position_dodge2(width = .9)) +
ylab("Proportion of Commutes") +
xlab("") +
ylim(0,1) +
scale_fill_discrete(name="",
labels=c("Delay", "Major Delay"))+
ggtitle("Proportion of Commutes with Delays
by Day of the Week")
# p4 - over time
p4 <- aggdata %>%
mutate(week = floor_date(date, unit="week"))%>%
group_by(week) %>%
summarise(delay = mean(delay), major_delay = mean(major_delay)) %>%
gather("share", "value", -week) %>%
ggplot(., aes(x=week, y = value, color= share)) + geom_line() +
ylab("Proportion of Commutes, Weekly Average") +
xlab("")+
ylim(0,1) +
scale_x_date(date_labels="%b %y",date_breaks ="1 month") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_color_discrete(name="",
labels=c("Delay", "Major Delay"))+
ggtitle("Proportion of Commutes with Delays
Over Time")
# most mentioned trains
temp <- working %>%
group_by(date, commute) %>%
summarise(delay = max(delay), major_delay = max(major_delay), train = toString(unique(train))) %>%
separate(train, sep=",", into = c("t1", "t2", "t3", "t4", "t5", "t6", "t7", "t8", "t9"), convert = T) %>%
mutate_at(vars(starts_with("t")), funs(as.numeric))
p5 <- temp %>%
gather("key", "value", t1, t2, t3, t4, t5, t6, t7, t8, t9) %>%
group_by(value) %>%
summarise(n=n()) %>%
filter(!is.na(value)) %>%
filter(n >4)%>%
ggplot(., aes(x = reorder(as.factor(value),-n, sum), y=n)) + geom_col( width = .6, position = "dodge2") +
theme(axis.text.x=element_text(angle=45, hjust=1))+
ylab("Num. Commute Delays") +
xlab("Train Number") +
ggtitle("Trains with the Most Delays")
# direction
```
## Overview
The goal of this project was to determine how often Caltrain is delayed for commuters.
**Short answer: Caltrain has significant delays on 1-2 commutes per week.**
## Background
[Caltrain](http://caltrain.com) is a commuter rail line that serves the San Francisco Peninsula. It serves about 65,000 riders per week day over it's 77-mile route, and has an operating budget of about $150 million.
Caltrain [defines on-time arrivals](http://www.caltrain.com/Assets/__Agendas+and+Minutes/JPB/2018/2018-12-06+Rail+Safety+Presentation.pdf) by comparing scheduled arrival times to actual arrival times at end-line locations (San Francisco 4th and King, San Jose Diridon, Tamien, or Gilroy stations). It has a stated goal of 95 percent on-time at end-line locations, defined at reaching the final destination no more than 6 minutes from the scheduled arrival time. By this metric, Caltrain's on-time performance (OTP) is about 93 percent.
Caltrain is primarily used by weekday commuters. Weekend ridership is about 85 percent less than weekday ridership, and the fullest trains in both directions operate during weekday commute hours (5-9AM and 3-7PM). Because of this, Caltrain's OTP metrics give an incomplete picture of reliability for most riders.
Caltrain operates a [twitter account](https://twitter.com/Caltrain) that provides "news, major service impacts & answers between 7am and 7pm" on weekdays. Data were gathered from this account and aggregated as shown in the methodology below.
## Results
```{r show figures, out.width='\\textwidth', fig.align='center'}
# grid.arrange(p1, p2,p3, p4, nrow = 4)
p1
p2
p3
p4
```
<em>Notes:</em> Data on delays from Caltrain twitter account. A delay is defined by the announcement of delays in a tweet for any train and any station, and a major delay is defined as at least 20 minutes behind schedule. Morning commute is defined as 7-9AM, and evening commute as 3-7PM.
## Methodology
Data were pulled from the Caltrain twitter account using the [rtweet](https://rtweet.info/index.html) package. This package uses the Twitter API to gather up to 3200 tweets on a user's timeline. Data were collected starting on `r min(aggdata$date)`. Replies were removed, and all tweets not during commute hours (7-9AM and 3-7PM) were dropped. All weekend tweets were dropped as well.
Tweets were aggregated by date and commute (morning/evening), such that each date had two commutes even if the account did not tweet during those times on a given day. A delay tweet was defined by mentioning the word "delay", "delayed", or a mention of -XX, where XX is a pair of integers. The latter definition is to capture the accounts shorthand for delay times (e.g. NB278 -15" departing MVW indicates that north bound train 257 is 15 minutes late departing the Mountain View station). A major delay is indicated by a delay of 20 minutes or more. Data were collapsed to record whether any tweets during commute times indicated a delay, and if so, whether there was a major delay.