-
Notifications
You must be signed in to change notification settings - Fork 1
/
Exampl_t_ANOVA.Rmd
148 lines (123 loc) · 6.63 KB
/
Exampl_t_ANOVA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
title: "t-Test and Anova"
author: "F.A. Barrios<br><small>Instituto de Neurobiología UNAM<br></small>"
date: "<small>`r Sys.Date()`</small>"
output:
rmdformats::robobook:
highlight: kate
---
```{r setup, include=FALSE}
library(knitr)
library(rmdformats)
## Global options
options(max.print="80")
opts_chunk$set(echo=TRUE,
cache=TRUE,
prompt=FALSE,
tidy=FALSE,
comment=NA,
message=FALSE,
warning=FALSE)
opts_knit$set(width=80)
```
# Examples for t-test and ANOVA
Examples of problems taken from chapter 3 (Vittinghoff's book http://www.biostat.ucsf.edu/vgsm) using data from the heart and estrogen/progestin study (HERS), a clinical trial of hormone therapy (HT) for prevention of recurrent heart attacks and death among 2,763 post-menopausal women with existing coronary heart disease (CHD). Also we will present some exercises from Daniel's book too, and this will cover a refresher for t-test and ANOVA (assuming parametric distributions).
## Introduction
We will cover some basic tools for assessing the statistical significance of differences between the average values of a continuous outcome across two or more samples following parametric statistics are the t-Test and ANOVA.
The example presented in (chapter 3 Table 3) of the t-Test of differences in glucose average combined with exercise for the women that are not diabetic in the HERS data set. These examples are to revisit some t-test R estimations and t-test function in R. And to remember that a **boxplot** gives a good amount of information about a numerical variable:
**Centering** described by the median.
**Dispersion**, measured by the high of the box (*interquartil* distance)
**Observation range** The presence of extreme values (*outliers*)
And some information of the distribution “linked to skewness”.
If the median (the bold lilne) is located toward the bottom of the box, then the data are *right-skewed* toward larger values. That is, the distance between the median and the 75th percentile is greater than that between the median and the 25th percentile. Likewise, right-skewness will be indicated if the upper whisker is longer than the lower whisker or if there are more outliers in the upper range.
```{r}
# To load the CRAN libraries used
library(multcomp)
library(tidyverse)
library(car)
library(emmeans)
```
```{r, echo= TRUE}
HERS <- read_csv("https://raw.githubusercontent.com/fabarrios/Regression/master/DataRegressBook/Chap3/hersdata.csv", show_col_types = FALSE)
# Loading the HERS database in HERS variable
names(HERS)
summary(HERS)
boxplot(glucose ~ diabetes, data=HERS)
var.test(glucose[HERS$diabetes == "no"] ~ exercise[HERS$diabetes == "no"],
data=HERS,
alternative="two.sided")
# Some graphical description of the glucose state
boxplot(glucose[HERS$diabetes == "no"] ~ exercise[HERS$diabetes == "no"], data=HERS)
t.test(glucose[HERS$diabetes == "no"] ~ exercise[HERS$diabetes == "no"],
data=HERS, alternative="two.sided", mu=0,
paired=FALSE,
var.equal=TRUE)
```
## t-test examles
Examples of t-test from the exercises of the Daniel's book, Chap 7, section 3 number 3. Data can be downloaded from the WEB page of the Daniel's book.
Hoekema et al. studied the craniofacial morphology of 26 male patients with obstructive sleep apnea syndrome (OSAS) and 37 healthy male subjects (non-OSAS). One of the variables of interest was the length from the most superoanterior point of the body of the hyoid bone to the Frankfort horizontal (measured in millimeters).
```{r, echo= TRUE}
# Daniel's chap 7 t-test examples
# EXR_C07_S03_03
Ex733 = read_csv(file="https://raw.githubusercontent.com/fabarrios/Regression/master/DataSets/ch07_all/EXR_C07_S03_03.csv", show_col_types = FALSE)
names(Ex733)
summary(Ex733)
# In parts
NoOSAS = Ex733$Length[Ex733$Group == 1]
OSAS = Ex733$Length[Ex733$Group == 2]
#
boxplot(NoOSAS, OSAS)
var.test(OSAS, NoOSAS)
t.test(NoOSAS, OSAS, alternative="less", conf.level = 0.99)
#
# Shorter and direct
boxplot(Length ~ Group, data = Ex733)
t.test(Length ~ Group, data = Ex733, alternative = "less", conf.level = 0.99)
```
How it looks for several variables an ANOVA example using the HERS data reffering to the diabetic participants.
```{r}
# Example of the HERS data for diabetic participants
# hers_yesdi <- filter(HERS, diabetes == "yes")
hers_yesdi <- HERS %>%
filter(diabetes == "yes") %>%
mutate(physact = factor(physact, levels=c("much less active","somewhat less active","about as active","somewhat more active","much more active")))
# Example of ANOVA with HERS data for diabetic participants
#
ggplot(data = hers_yesdi, mapping = aes(x = physact, y = glucose)) +
geom_boxplot(na.rm = TRUE)
glucose_yesdi_activ <- aov(glucose ~ physact, data = hers_yesdi)
glucose_yesdi_activ
# Or print the anova table using the car package function
Anova(glucose_yesdi_activ, type="II")
#
S(glucose_yesdi_activ)
glucose_emmeans <- emmeans(glucose_yesdi_activ, "physact")
contrast(glucose_emmeans, adjust="sidak")
```
# Example from R-bloggers
Plotting data distributions of random data to do an example of ggplot color filling. First we build four random variables with two different distributions. And then followed of a ggplot command using the color to distinguish the variables.
```{r}
# Create the four groups
set.seed(10)
df1 <- data.frame(Var="a", Value=rnorm(100,10,5))
df2 <- data.frame(Var="b", Value=rnorm(100,10,5))
df3 <- data.frame(Var="c", Value=rnorm(100,11,6))
df4 <- data.frame(Var="d", Value=rnorm(100,11,6))
# merge them in one data frame
df<-rbind(df1,df2,df3,df4)
# convert Var to a factor
df$Var <- as.factor(df$Var)
df %>% ggplot(aes(x=Value, fill=Var)) + geom_density(alpha=0.5)
```
## The ANOVA (taken from R-bloggers)
ANOVA (ANalysis Of VAriance) is a statistical test used to compare two or more groups to see if they are significantly different. The ANOVA model and some examples. The null hypothesis in ANOVA is that there is no difference between means and the alternative is that the means are not all equal. This means that when we are dealing with many groups, we cannot compare them pairwise. We can simply answer if the means between groups can be considered as equal or not.
```{r}
# ANOVA
model1 <- aov(Value ~ Var, data=df)
anova(model1)
```
## Tukey multiple comparisons
What about if we want to compare all the groups pairwise? In this case, we can apply the Tukey’s HSD which is a single-step multiple comparison procedure and statistical test, Tukey's Honest Significant Difference (Tukey's HSD). It can be used to find means that are significantly different from each other.
```{r}
summary(glht(model1, mcp(Var="Tukey")))
```