-
Notifications
You must be signed in to change notification settings - Fork 0
/
regression-1-1.Rmd
executable file
·210 lines (139 loc) · 7.66 KB
/
regression-1-1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
---
title: "1 - Thought experiments"
author: "Dr James Tripp, CIM (Released under the GPL 3.0)"
output:
html_document:
df_print: paged
html_notebook: default
---
# Introduction
The topic of this workshop is Bivariate regression - a very simple mathematical model, a line. We will go from the basics of bivariate regression in R to model inference and assumption checking. In the first hour we cover the basic with made up data and in the second hour we progress to the real life data.
This file is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code. By using this format you can be introduced to the code with annotations and run sets of the code yourself. All of the code can run by pressing the play button above each cell, or clicking on options under the run button. You can also copy and paste the code into another file or the console.
# Data
We need some data. We will use fictional data.
## Our fictional study
Let us imagine we are interested on the influence of drinks consumed in our local drinking establishment. In particular, how does the amount of substances in drinks influence arcade game performance? We choose to do 3 studies looking at the influence of different amounts of 3 substances: artificial sweetner, alcohol and caffeine.
Our methodology for each substance is to:
1. Prepare 10 drinks with 10, 20, 30, 40, 50, 60, 70, 80, 90 and 100 of the substance mixed with water. (ml for alcohol and sweetner, mg for caffeine)
2. Choose 10 random people
3. For each person
i) After signing consent forms (keeping everything highly ethical!) give the person the drink to consume
ii) Let them play an arcade game
iii) Record their score
## Our data
Let us clear our R enviroment
```{r clean_up}
rm(list=ls())
```
and create a data frame for each of our data sets. Each data frame is given fake data for model fitting purposes.
```{r echo=TRUE, artifical_data}
df.sweetner <- data.frame(sweetner = seq(10,100,10), score = c(50, 45, 55, 58, 55, 51, 59, 60, 50, 55))
df.alcohol <- data.frame(alcohol = seq(10,100,10), score = c(50, 48, 50, 35, 28, 30, 20, 21, 20, 2))
df.caffeine <- data.frame(caffeine = seq(10,100,10), score = c(42, 55, 60, 70, 68, 75, 78, 60, 50, 30))
```
We have a predictor (the substance amount) and an outcome (arcade game score) in each data set. These data appear perfect to fit a bivariate regression - a straight line model with one predictor and one outcome.
James will show you how to do this with the alcohol data set. Your task at the end of the demonstration is to try and fit a model to either the sweetner or caffeine data sets.
What does our data set look like? One way to see this is to print out the values.
```{r echo=TRUE, print_artifical_data}
df.alcohol
```
What do you notice about the impact of alcohol on score? Is there a relationship between the predictor and the outcome?
Perhaps a plot will make this clearer?
```{r plot_artifical_data}
plot(df.alcohol)
```
# A Regression!
We can describe the relationship between alcohol and score with a straight line! Technically this line is a *linear model*. The equation for this line is:
$$
score_i = \alpha + \beta \cdot alcohol_i + \epsilon
$$
This equation may look daunting but stick with it. The equation is to predict the value of score using an intercept (alpha), some random error (epsilon) and the corresponding value of alcohol multiplied by a value we call beta. It is up to R to find the straight line which gets as close to all the points as possible.
## Fitting the regression
We use a function called lm (linear model) to find the line which goes through the data.
```{r regression_fit}
m1 <- lm(formula = score~alcohol, data = df.alcohol)
```
An easy way to see this model is on a plot.
```{r plot_regression_line}
plot(df.alcohol$alcohol, df.alcohol$score)
abline(m1)
```
R did a good job! The line goes through most of the data.
## A linear model?
But how does this relate to the equation you saw earlier? Has R figured out the value of the intercept (start of the line) and the coefficient for alcohol (value to multiply alcohol by)?
```{r print_m1}
print(m1)
```
Printing out m1 tells us the values R chose (the coefficients). The intercept is 57.6 and the coefficient for alcohol is -0.4945. Going back to the equation above, we can calculate a predicted score by multiplying a value of alcohol by -0.4945 and adding the intercept value.
Skeptical? I would be too. Let's check and see if the values match up.
```{r print_predicted}
df.alcohol$predicted <- 57.6 + (df.alcohol$alcohol * -0.4945)
print(df.alcohol)
```
That is pretty good. However, we still miss some points. For example, we miss the actual score when the dose of alcohol is 50ml by about 7. This error is the epsilon in the equation above. R found the straight line with the smallest errors. Well done R! We now know how to fit a bivariate regression model to data.
A small aside: we used a formula when fitting the linear model. Specifically, we told the lm function to create a linear model which predicts score using alccohol. Interestingly, this is a formula object. Note that we do not use apostrophes either side of the statement
```{r}
class(score ~ alcohol)
```
instead of
```{r}
class('score ~ alcohol')
```
## Under the hood
What is this m1 variable? Does it contain only the coefficient values or is there more to it? Let us check the class
```{r m1_class}
class(m1)
```
and structure of m1
```{r m1_str}
str(m1)
```
It seems there is a lot more in the m1 variable. Where should we start?
Well, there are the coefficients
```{r m1_coefficients}
m1$coefficients
```
and also the errors
```{r m1_residuals}
m1$residuals
```
which are the difference between the actual score and the predicted score
```{r calc_predicted}
df.alcohol$score - df.alcohol$predicted
```
and the original data
```{r m1_data}
m1$model
```
There is more to explore but, alas, we must press on.
We can do more things with a model object (variable) by passing it to a function. A common function used with linear models is summary. The summary function produced details about inferential statistics.
```{r summary}
summary(m1)
```
We will go into more detail about this output in the next file. Suffice to say, many people fit linear models explicitely to get this output and in our case indicates alcohol influences score.
## Making predictions
Another function used with linear models is predict. We can ask R what the model predicts for new data. Imagine we gave people even more alcohol.
```{r more_alcohol}
df.alcohol.more <- data.frame(alcohol = c(200, 250, 300))
```
We ask R to predict the score using the linear model already fit. In other words, take each new alcohol value and work out the predicted score.
$$
score_i = 57.6 + (-0.49455 \cdot alcohol_i)
$$
We do this using the predict function
```{r prediction?}
predict(m1, df.alcohol.more)
```
We can this add new data alonside the old.
```{r with_predictions}
plot(c(0,300), c(-100, 100), type='n', xlab = 'Alcohol (ml)', ylab = 'Score') # make blank plot
points(df.alcohol$alcohol, df.alcohol$score, col="blue", pch = 15)
points(df.alcohol.more$alcohol, predict(m1, df.alcohol.more), col="red", pch = 8)
abline(m1)
legend('bottomleft',
legend = c('Data', 'Predicted Data', 'Model'),
col = c('blue', 'red', 'black'),
pch = c(15, 8, 9),
text.col = 'Black')
```
Do you believe these predictions? If not, do you think we should be wary of model predictions? As a rule of thumb, models are often as good as the people who choose the terms and interpret the results. On that positive note, it is time for a break followed by the next document! Also, do try and fit a model to the other hypothetical data.