-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathhdat9600ch5_tbla.Rmd
176 lines (117 loc) · 7.85 KB
/
hdat9600ch5_tbla.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
---
title: "HDAT9600 Chapter 5 TBLA"
subtitle: "input details here"
author: "Victor Hui, Haran Nathan, Maathumai Ranjan, Meg Stevens"
date: "insert date of completeion here"
output: html_document
---
```{r setup, include=FALSE}
# leave this code here, but feel free to adjust the options or add some more
# see the knitr documentation for details
knitr::opts_chunk$set(echo = TRUE, fig.width=12, fig.height=12)
```
## Instructions
As a team, you should complete the tasks described below, in the spaces provided. Don't hesitate to ask each other and the course instructors for help (use the OpenLearning environment to do that) --- remember that these are learning activities and we learn most when we share.
## Task 1 (7 marks)
In this task, you will be using the `nodal` dataset which comes with the _boot_ package for R which you should already have installed --- it should have been installed automatically when you installed the package containing the HDAT9600 chapter 2 tutorial. If not, install it now using the Packages tab in RStudio or the `install.packages()` function at R console prompt.
The manual (help) page for the `nodal` dataset states the following:
> The nodal data frame has 53 rows and 7 columns.
> The treatment strategy for a patient diagnosed with cancer of the prostate depend highly on whether the cancer has spread to the surrounding lymph nodes. It is common to operate on the patient to get samples from the nodes which can then be analysed under a microscope but clearly it would be preferable if an accurate assessment of nodal involvement could be made without surgery.
> For a sample of 53 prostate cancer patients, a number of possible predictor variables were measured before surgery. The patients then had surgery to determine nodal involvement. It was required to see if nodal involvement could be accurately predicted from the predictor variables and which ones were most important.
The variables in the data set are all binary variables:
`m`
: A column of ones (this column is deleted in the set-up code below, so ignore it henceforth.)
`r`
: An indicator of nodal involvement.
`aged`
: The patients age dichotomized into less than 60 (0) and 60 or over 1.
`stage`
: A measurement of the size and position of the tumour observed by palpitation with the fingers via the rectum. A value of 1 indicates a more serious case of the cancer.
`grade`
: Another indicator of the seriousness of the cancer, this one is determined by a pathology reading of a biopsy taken by needle before surgery. A value of 1 indicates a more serious case of the cancer.
`xray`
: A third measure of the seriousness of the cancer taken from an X-ray reading. A value of 1 indicates a more serious case of the cancer.
`acid`
: The level of acid phosphatase in the blood serum.
```{r task-1-setup}
# this loads the nodal dataset and makes it available for code in
# subsequent code chunks.
data(nodal, package="boot")
nodal$m <- NULL
```
### 1.a
Carry out a brief exploratory data analysis, using the `summary()` function on the whole data frame, and also plotting it as a matrix image. Ensure that the columns in the data frame are plotted as columns in the matrix image. Hint: `image()`. Commentary is optional. (1 mark)
```{r task-1-a}
# insert your R code (with comment lines if you wish) here
summary(nodal)
head(nodal)
# function to rotate matrices, taken from Chapter 1 section_3.rmd
rotate <- function(x) t(apply(x, 2, rev))
image(rotate(matrix(c(nodal$r,nodal$aged),nrow=53,byrow=FALSE)),
xlab="LEFT = r - nodal involvement, RIGHT = age under 60 or over",
ylab="red is 1, yellow is 0")
# each band of the image represents one data row
image(rotate(matrix(c(nodal$r,nodal$stage),nrow=53,byrow=FALSE)),
xlab="LEFT = r - nodal involvement, RIGHT = more serious case of cancer by digital palpation stage",
ylab="red is 1, yellow is 0")
image(rotate(matrix(c(nodal$r,nodal$grade),nrow=53,byrow=FALSE)),
xlab="LEFT = r - nodal involvement, RIGHT = more serious case of cancer by biopsy result",
ylab="red is 1, yellow is 0")
image(rotate(matrix(c(nodal$r,nodal$xray),nrow=53,byrow=FALSE)),
xlab="LEFT = r - nodal involvement, RIGHT = more serious case of cancer by x-ray result",
ylab="red is 1, yellow is 0")
image(rotate(matrix(c(nodal$r,nodal$acid),nrow=53,byrow=FALSE)),
xlab="LEFT = r - nodal involvement, RIGHT = level of phosphatase in blood serum",
ylab="red is 1, yellow is 0")
# plotting whole data frame as an image
image(1:ncol(nodal), 1:nrow(nodal), rotate(as.matrix(nodal)), axes=FALSE, ylab="Observations (red is 1, yellow is 0)", xlab="")
# add axis labels, note: 3 <- above, 2 <- left
image(1:ncol(nodal), 1:nrow(nodal), rotate(as.matrix(nodal)), axes=FALSE, add=TRUE)
axis(3, at=1:ncol(nodal), labels=colnames(nodal))
```
_Replace this text with your optional commentary for task 1.a_
All data points are binary.<br>
r - nodal involvement is the outcome variable and the others would be predictors<br>
Matrix images - each band of the image represents one data row<br>
None of the predictors are obviously more predictive than the others - based on eyeball matching of colours<br>
Perhaps there are more individuals with 2 or more predictor variables with nodal involvement than without
### 1.b
Fit an appropriate regression model with nodal outcome (the `r` variable) as the outcome and all the other variables as predictors. Display the model summary and comment on it, in particular whether there is evidence that any of the predictors are related to the outcome. (2 marks)
```{r task-1-b}
# insert your R code (with comment lines if you wish) here
nodal_mod <- glm(r ~ aged + stage + grade + xray + acid, family=binomial, data=nodal)
summary(nodal_mod)
# str(nodal_mod)
# this gives the structure of the model object, but also prints a long list of stuff, hence commented out unless required
```
Xray and Acid predictors are statistically significant.
Age, Stage and Grade are not statistically significant.
Based on a statistically significant P value of p=0.05
### 1.c
Fit a smaller model that removes the two least significant variables in the model fitted in task 1.b. Can this smaller model be used in preference to the larger model? Justify your answer by creating an appropriate analysis of deviance. (2 marks)
```{r task-1-c}
# insert your R code (with comment lines if you wish) here
nodal_small <- glm(r ~ stage + xray + acid, family=binomial, data=nodal)
summary(nodal_small)
# Comparing the two models
print(anova(nodal_mod,nodal_small, test="Chi"))
```
The two least significant values are aged and grade<br>
When removed, the three predictors all become statistically significant P<0.05<br>
The analysis of deviance by ANOVA shows that removal of the two predictors (difference between the two models) is statistically significant, therefore we can conclude that aged and grade are not significant predictors of nodal `r`, if the other three variables are also in the model.
### 1.d
Using the smaller model from task 1.c, calculate how much does having a serious x-ray result increase the odds of nodal involvement compared to have a non-serious x-ray result? Provide a 95% confidence interval for the odds. (2 marks)
```{r task-1-d}
# insert your R code (with comment lines if you wish) here
# exponents of the betas can be interpreted directly as odds
beta <- nodal_small$coefficients
exp(beta)
# profile likelihood-based confidence interval using confint function
confint <- confint(nodal_small)
# xray is the third variable
xray_ci <- confint[3,]
exp(xray_ci)
```
The odds of nodal involvement increases by 5.8x in those with a serious X-ray result (compared to a non-serious X-ray result), according to this model. The 95% confidence interval is from a 1.6x increase in odds to 35.4x increase in odds with a serious compared to non-serious X-ray result.
### Save and knit
**Reminder**: don't forget to save this file, to knit it and check that everything works.