-
Notifications
You must be signed in to change notification settings - Fork 0
/
Non-Seasonal Time Series Analysis Project.Rmd
350 lines (306 loc) · 13.3 KB
/
Non-Seasonal Time Series Analysis Project.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
---
title: "Non-Seasonal Time Series Analysis"
author: 'Aston Glen Noronha (CWID: 20012232)'
date: "`r Sys.Date()`"
output:
html_document:
df_print: paged
---
<span style="font-size: 18px;font-weight: bold;">Introduction and Motivation </span>
<span style="font-size: 18px;">Stock market data can be interesting to analyze and as a further incentive, strong predictive models can have large financial payoff. The amount of financial data on the web is seemingly endless. A large and well-structured dataset on a wide array of companies can be hard to come by. Here I am using a dataset containing Twitter stock prices from 2013 to 2019.</span>
<span style="font-size: 18px;">The goal of this project is to predict and forecast daily return values for a particular stock. Given the highly
volatile nature of stock data, we will fit a univariate GARCH model to achieve our goal of predicting the daily returns value to allow you to make statistically informed trades!</span>
<span style="font-size: 18px;font-weight: bold;">Data Description </span>
<span style="font-size: 18px;">Data Range : From November 2013 to December 2019 </span>
<span style="font-size: 18px;">Dataset Description : The dataset contains Twitter Stock Prices </span>
<span style="font-size: 18px;">Datasource Description : The data is from Kaggle website and can be accessed using the link: https://www.kaggle.com/datasets/maharshipandya/twitter-stocks-dataset</span>
<span style="font-size: 18px;">Date format : yyyy-mm-dd </span>
<span style="font-size: 18px;">Open : price of the stock at market open </span>
<span style="font-size: 18px;">High : Highest price reached in the day </span>
<span style="font-size: 18px;">Low : Lowest price reached in the day</span>
<span style="font-size: 18px;">Close : price of the stock at market close</span>
<span style="font-size: 18px;">Volume : Number of shares traded</span>
```{r}
library(stats)
library(tseries)
library(tidyverse)
library(TSA)
library(ggplot2)
library(rugarch)
library(forecast)
library(dplyr)
```
<span style="font-size: 18px;font-weight: bold;">Let's take a look at the dataframe we are working with</span>
```{r}
df <- read.csv("C:/Users/Aston/Downloads/Academic Projects/Time Series Project/twitter-stocks.csv")
df <- select(df, Date, Close)
df$Date <- as.Date(df$Date, "%Y-%m-%d")
num_rows <- nrow(df)
new_dates <- seq(from = as.Date("2013-11-01"), by = "days", length.out = num_rows)
df$Date <- new_dates
start_date <- as.Date("2013-11-01")
end_date <- as.Date("2019-12-31")
filtered_data <- subset(df, Date >= start_date & Date <= end_date)
head(filtered_data)
tail(filtered_data)
```
<span style="font-size: 18px;font-weight: bold;">Initial look at the data:</span>
```{r}
ggplot(filtered_data, aes(x = Date, y = Close)) +
geom_line() +
labs(title = "Time Series of Close Prices(Original)",
x = "Date",
y = "Close Price")
train_data <- subset(df, Date >= start_date & Date <= as.Date("2019-10-31"))
test_data <- subset(df, Date > as.Date("2019-10-31") & Date <= end_date)
data1 <- train_data$Close
```
```{r}
acf(data1, lag.max = 150, main='ACF of Original data')
```
```{r}
pacf(data1, lag.max = 150, main='PACF of Original data')
```
<span style="font-size: 18px;font-weight: bold;">EACF of the Original data</span>
```{r}
eacf(data1)
```
<span style="font-size: 18px;">From the ACF we see that it is not cutting off and constantly dying. From the PACF, it seems like AR(1) model. From the EACF, we can try different ARMA orders to try and capture all the dependencies.</span>
<span style="font-size: 18px;font-weight: bold;">Checking for stationarity</span>
```{r}
adf_test = adf.test(data1)
print(adf_test)
```
<span style="font-size: 18px;">Since p-value=0.3565 > 0.05, the TS is non-stationary</span>
<span style="font-size: 18px;font-weight: bold;">Differencing to make the data stationary</span>
```{r}
close_diff <- diff(data1)
plot(close_diff, type = 'l', xlab = 'Date', ylab = 'Value', main = 'Line graph for log-differencing')
```
```{r}
adf_test = adf.test(close_diff)
print(adf_test)
```
<span style="font-size: 18px;">Since p-value=0.01 < 0.05, the TS is stationary</span>
```{r}
acf(close_diff, main='ACF of log-differenced data')
pacf(close_diff, main='PACF of log-differenced data')
```
<span style="font-size: 18px;font-weight: bold;">EACF of the differenced data</span>
```{r}
eacf(close_diff)
```
<span style="font-size: 18px;font-weight: bold;">Functions for ARIMA models</span>
```{r}
plot_arima_resid <- function(model) {
resid <- resid(model)
resid_ts <- ts(resid, start=c(2013, 311), end=c(2020,300), frequency=365)
p <- model$arma[1]
q <- model$arma[2]
plot(resid_ts, xlab = "Year", ylab = 'Residuals', main=paste('Residual Plot',"(", p, ", 1, ", q, ")"))
acf(resid_ts[1:length(resid_ts)],lag.max=150, main=paste('ACF of Residual',"(", p, ", 1, ", q, ")"))
pacf(resid_ts[1:length(resid_ts)],lag.max=150, main=paste('PACF of Residual',"(", p, ", 1, ", q, ")"))
}
my_df <- data.frame(model = character(),
AIC = numeric(),
BIC = numeric(),
Shapiro = round(numeric(),3),
Ljung = round(numeric(),3))
new_row <- data.frame(model='',AIC=0,BIC=0,Shapiro=0,Ljung=0)
plot_arima_residuals <- function(arima_model, my_df) {
resid <- residuals(arima_model)
resid_ts <- ts(resid, start=c(2013, 311), end=c(2020,300), frequency=365)
p <- arima_model$arma[1]
q <- arima_model$arma[2]
order <- paste("(", p, ", 1, ", q, ")", sep = "")
plot_arima_resid(arima_model)
qqnorm(resid_ts, main=paste('Residuals plot',order))
qqline(resid_ts)
shap=shapiro.test(resid_ts)
ljung=Box.test(resid, lag = 20, type = "Ljung-Box")
cat("AIC:", AIC(arima_model),"\n")
cat("BIC:", BIC(arima_model),"\n")
model_name <- paste("ARIMA", paste(order, collapse = ','), sep = " ")
if (!model_name %in% my_df$model) {
new_row <- data.frame(model = model_name,
AIC = AIC(arima_model),
BIC = BIC(arima_model),
Shapiro = round(shap$p.value, 3),
Ljung = round(ljung$p.value, 3))
my_df <- rbind(my_df, new_row)
}
}
```
```{r}
arima011 <- arima(x=close_diff, order=c(0,1,1))
arima110 <- arima(x=close_diff, order=c(1,1,0))
arima111 <- arima(x=close_diff, order=c(1,1,1))
arima210 <- arima(x=close_diff, order=c(2,1,0))
arima211 <- arima(x=close_diff, order=c(2,1,1))
arima012 <- arima(x=close_diff, order=c(0,1,2))
arima112 <- arima(x=close_diff, order=c(1,1,2))
arima212 <- arima(x=close_diff, order=c(2,1,2))
arima310 <- arima(x=close_diff, order=c(3,1,0))
arima311 <- arima(x=close_diff, order=c(3,1,1))
arima312 <- arima(x=close_diff, order=c(3,1,2))
arima013 <- arima(x=close_diff, order=c(0,1,3))
arima113 <- arima(x=close_diff, order=c(1,1,3))
arima213 <- arima(x=close_diff, order=c(2,1,3))
arima313 <- arima(x=close_diff, order=c(3,1,3))
```
<span style="font-size: 18px;font-weight: bold;">ARIMA(0,1,1)</span>
```{r}
my_df <- plot_arima_residuals(arima011, my_df)
```
<span style="font-size: 18px;font-weight: bold;">ARIMA(1,1,0)</span>
```{r}
my_df <- plot_arima_residuals(arima110, my_df)
```
<span style="font-size: 18px;font-weight: bold;">ARIMA(1,1,1)</span>
```{r}
my_df <- plot_arima_residuals(arima111, my_df)
```
<span style="font-size: 18px;font-weight: bold;">ARIMA(2,1,0)</span>
```{r}
my_df <- plot_arima_residuals(arima210, my_df)
```
<span style="font-size: 18px;font-weight: bold;">ARIMA(2,1,1)</span>
```{r}
my_df <- plot_arima_residuals(arima211, my_df)
```
<span style="font-size: 18px;font-weight: bold;">ARIMA(0,1,2)</span>
```{r}
my_df <- plot_arima_residuals(arima012, my_df)
```
<span style="font-size: 18px;font-weight: bold;">ARIMA(1,1,2)</span>
```{r}
my_df <- plot_arima_residuals(arima112, my_df)
```
<span style="font-size: 18px;font-weight: bold;">ARIMA2,1,2)</span>
```{r}
my_df <- plot_arima_residuals(arima212, my_df)
```
<span style="font-size: 18px;font-weight: bold;">ARIMA(3,1,0)</span>
```{r}
my_df <- plot_arima_residuals(arima310, my_df)
```
<span style="font-size: 18px;font-weight: bold;">ARIMA(3,1,1)</span>
```{r}
my_df <- plot_arima_residuals(arima311, my_df)
```
<span style="font-size: 18px;font-weight: bold;">ARIMA(3,1,2)</span>
```{r}
my_df <- plot_arima_residuals(arima312, my_df)
```
<span style="font-size: 18px;font-weight: bold;">ARIMA(0,1,3)</span>
```{r}
my_df <- plot_arima_residuals(arima013, my_df)
```
<span style="font-size: 18px;font-weight: bold;">ARIMA(1,1,3)</span>
```{r}
my_df <- plot_arima_residuals(arima113, my_df)
```
<span style="font-size: 18px;font-weight: bold;">ARIMA(2,1,3)</span>
```{r}
my_df <- plot_arima_residuals(arima213, my_df)
```
<span style="font-size: 18px;font-weight: bold;">ARIMA(3,1,3)</span>
```{r}
my_df <- plot_arima_residuals(arima313, my_df)
```
<span style="font-size: 18px;font-weight: bold;">BIC Values of the models</span>
```{r}
BIC_values <- c(BIC(arima011), BIC(arima110), BIC(arima111),
BIC(arima210), BIC(arima211), BIC(arima012), BIC(arima112), BIC(arima212),
BIC(arima310), BIC(arima311), BIC(arima312), BIC(arima013), BIC(arima113), BIC(arima213), BIC(arima313))
BIC_values
```
<span style="font-size: 18px;font-weight: bold;">Lowest BIC Value</span>
```{r}
min_BIC <- min(BIC_values)
min_BIC
```
```{r}
min_BIC_index <- which.min(BIC_values)
min_BIC_index
```
<span style="font-size: 18px;">The lowest BIC values is from the model ARIMA(0,1,1). Therefore, we continue with the residual analysis to determine if the model has captured all the dependencies.</span>
<span style="font-size: 18px;font-weight: bold;">Residual Analysis for ARIMA(0,1,1)</span>
```{r}
residuals <- residuals(arima011)
acf(residuals, main = 'ACF of Residuals')
pacf(residuals, main = 'ACF of Residuals')
```
<span style="font-size: 18px;font-weight: bold;">Ljung's Box Test</span>
```{r}
tsdiag(arima011)
```
<span style="font-size: 18px;">From the above plot, we can see that most of the p-values are above the confidence interval, so lets try and forecast for the next 60 days.</span>
```{r}
predicted_values <- predict(arima011, n.ahead = 60)
original_forecasts <- diffinv(predicted_values$pred, lag = 1, differences = 1, xi = 39.84)
original_forecasts <- original_forecasts[original_forecasts != 0]
original_forecasts
```
```{r}
plot(test_data$Date, test_data$Close, main = 'Original Data vs Forecast for 2019', type = 'l', col = 'blue', xlab = 'Date', ylab = 'Employees')
lines(test_data$Date, original_forecasts, col = 'red', type = 'l')
legend('topleft', legend = c('Original Data', 'Forecasts'), col = c('blue', 'red'), lty = 1)
```
<span style="font-size: 18px;">As we can see from the forecast against the original data, we can say that the ARIMA model has not captured the dependencies. The reason this happens is because the variance of the original data is not constant. We go ahead and implement GARCH model to our data. </span>
<span style="font-size: 18px;font-weight: bold;">GARCH Model</span>
```{r}
return = diff(log(train_data$Close))^2
#Try with absolute and square of the returns
acf(return, lag.max = 100, main='ACF of return')
pacf(return, lag.max = 100, main='PACF of return')
```
```{r}
garch(x=return, grad='numerical', trace=FALSE)
garch_spec <- ugarchspec(variance.model = list(garchOrder = c(2,2)),
mean.model = list(armaOrder = c(2,2)),
distribution.model = 'std')
garch_fit <- ugarchfit(spec = garch_spec, data = return, solver = 'hybrid')
summary(garch_fit)
garch_fit
```
<span style="font-size: 18px;font-weight: bold;">Residual Analysis</span>
```{r}
resid_norm = residuals(garch_fit)
qqnorm(resid_norm, main='QQ plot of residuals from GARCH ')
qqline(resid_norm)
acf(resid_norm, lag.max = 60, main='ACF of residual')
pacf(resid_norm, lag.max = 60, main='PACF of residual')
```
<span style="font-size: 18px;font-weight: bold;">Forecasting and Simulating</span>
```{r}
sfinal <- garch_spec
setfixed(sfinal) <- as.list(coef(garch_fit))
coef(garch_fit)
f2019 <- ugarchforecast(data = return, fitORspec = sfinal, n.ahead = 60)
plot(sigma(f2019))
sim <- ugarchpath(spec = sfinal, m.sim = 2, n.sim = 1*60, rseed = 123)
p <- 39.84*apply(fitted(sim), 2, 'cumsum') + 39.84
matplot(p, type = "l", lwd = 3)
```
<span style="font-size: 18px;font-weight: bold;">gjrGARCH Model</span>
```{r}
garch_spec <- ugarchspec(variance.model = list(model = 'gjrGARCH'),
mean.model = list(armaOrder = c(2,2)),
distribution.model = 'std')
garch_fit <- ugarchfit(spec = garch_spec, data = return, solver = 'hybrid')
summary(garch_fit)
garch_fit
resid_norm = residuals(garch_fit)
qqnorm(resid_norm, main='QQ plot of residuals from gjr-GARCH ')
qqline(resid_norm)
acf(resid_norm, lag.max = 60, main='ACF of residual')
pacf(resid_norm, lag.max = 60, main='PACF of residual')
sfinal <- garch_spec
setfixed(sfinal) <- as.list(coef(garch_fit))
coef(garch_fit)
f2019 <- ugarchforecast(data = return, fitORspec = sfinal, n.ahead = 60)
sim <- ugarchpath(spec = sfinal, m.sim = 2, n.sim = 1*60, rseed = 123)
p <- 39.84*apply(fitted(sim), 2, 'cumsum') + 39.84
matplot(p, type = "l", lwd = 3)
```