-
Notifications
You must be signed in to change notification settings - Fork 4
/
ch17other-nonpar-tests.Rmd
277 lines (236 loc) · 13 KB
/
ch17other-nonpar-tests.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
# Other nonparametric tests {#ch-other-nonpar-tests}
## Introduction {#sec:h17introduction}
In this chapter, we discuss different nonparametric tests.
These tests can be used when the data is not
measured on an interval level of measurement (see Chapter
\@ref(ch-levelsofmeasurement)), or if the probability distribution of the
data deviates from the normal distribution (see
§\@ref(sec:whatifnotnormal)). The nonparametric tests do not make
assumptions about the parameters of the probability distribution of the data.
Earlier, we already saw that nonparametric correlation coefficients exist,
namely the Spearman's rank correlation coefficient (§\@ref(sec:Spearman))
and the (nominal) Phi correlation coefficient
(§\@ref(sec:Phi)). In the previous chapter, we discussed
a much used nonparametric test, the $\chi^2$-test. Below, we will look at some other
frequently used nonparametric tests. We discuss these in two groups:
firstly for paired observations, and afterwards for unpaired observations from
multiple samples. In each subsection, we will firstly discuss the tests
which use information of the nominal level (sign tests and related) and
then the tests which use information of the ordinal level, i.e. which are based on the
rank order of the observed values.
## Paired observations, single sample
### Sign test {#sec:signtest}
A handy test for paired observations is the so-called sign test.
This test can be viewed as a nonparametric, nominal counterpart of
the *t*-test for paired observations
(§\@ref(sec:ttest-paired)).
In this test, we look only at the *sign* (positive or negative)
of the *difference* $D$ between the two paired observations. Let us again
take the example of an imaginary study on webpages with
*U* (Dutch formal 'you') and *je* (Dutch informal 'you') as forms of address, with
$N=10$ respondents. In Table \@ref(tab:data-uje-paired), we saw
that all 10 respondents preferred *je*: the difference variable $D$ was
$10\times$ negative and $0\times$ positive, or put differently, all the outcomes
of $D$ were negative.
With the sign test, we look at how probable this distribution of positive and
negative values of $D$ would be, if H0 were correct. According to
H0, we expect $N/2$ positive and $N/2$ negative differences; according to
H0, the probability of a positive sign of $D$ (the probability of a hit) is thus
$p=1/2$. We now determine the probability of the observed outcome (0 hits)
given H0, and we use the binomial probability distribution for this
(§\@ref(sec:binomial-distribution)):
\begin{equation}
(\#eq:prob-binom-uje)
P(0\,\mbox{hits}) = {10 \choose 0} (0.5)^0 (1-0.5)^{10-0} = (1) (1) (0.000976) < 0.001
\end{equation}
The probability of this outcome according to H0 is so small that, in light of this
observed (and presumably valid) outcome, we decide to reject H0, and we report this as
follows:
> The $N=10$ respondents unanimously give a lower judgement to the
> webpage with *U* as the form of address than to the comparable page
> with *je* as the form of address; this is a significant difference
> (sign test, $p<.001$).
### Wilcoxon signed-ranks test {#sec:Wilcoxon-signed-rank}
The Wilcoxon signed-ranks test can be viewed as a
nonparametric, ordinal counterpart of the *t*-test for paired
observations (§\@ref(sec:ttest-paired)).
This test makes use of the *rank order* of the difference $D$ between
the two paired observations. We will again use the example of the imaginary
study on webpages with *U* or *je* as forms of
address
(Table \@ref(tab:data-uje-paired)), but will now look at the *rank order*
of the differences $D$ (taking into account equal differences from several
participants), and indicate the sign (positive or
negative) of the difference $D$:
$$-2, -2, -7.5, -5, -7.5, -5, -10, -7.5, -5, -2$$
The sum of the positive rankings is $W_+=0$ (there are no positive rankings)
and the sum of the negative rankings $W_-= -53.5$, and with that $|W_-|=53.5$.
The smallest of
these two sums ($W_+$ or $|W_-|$) forms the test statistic; here, we use
$|W_-|$.
We will not discuss the probability distribution of the test statistic
but instead have the significance calculated by computer: $P(|W_-|)=.006$. The
probability of this outcome according to H0 is so small that, in light of this
observed (and presumably valid)
outcome, we again decide to reject H0.
The (ordinal) Wilcoxon signed-ranks test makes use of more information
than the (nominal) sign test. If an effect is significant according to the sign
test, as is the case in this example, then it is also always significant
according to the Wilcoxon signed-ranks test. If an effect is significant
in the Wilcoxon signed-ranks test, then it is also always significant according
to the *t*-test. This has to do with the level of measurement: the sign test
considers only the (nominal) *sign* of the differences, the
Wilcoxon signed-rank is based on the (ordinal) *ranking* of the
differences, and the *t*-test is based on the (interval) *size* of the
differences.
#### formulas
We not only calculate $W_+$ (or $|W_-|$) in the aforementioned manner, but also
the corresponding value of $z$ [@Ferg89]:
\begin{equation}
(\#eq:Wilcoxon-signedrank-z)
z = \frac{ W_+ - \frac{N(N+1)}{4} } { \sqrt{ \frac{N(N+1)(2N+1)}{24} } }
\end{equation}
With this, we can calculate the effect size, in the form of
a correlation [@Rose91 Eq.2.18]:
\begin{equation}
(\#eq:es-z2r)
r = \frac {z} {\sqrt{N}}
\end{equation}
For the example above, we find
$z=-2.80$, and $r=-.89$, which indicates an extremely large effect.
## Independent observations, multiple samples
### Median test
The median test can be viewed as a nonparametric, nominal
counterpart
of the *t*-test for unpaired, independent observations. It is actually
a sign test (see \@ref(sec:signtest)), in which we test whether the
distribution of observations above/below their *joint* median
(see §\@ref(sec:median) for explanation about the median) deviates from
the expected distribution according to H0.
The null hypothesis H0 is that the distributions of the two samples
do not differ from each other, and that approximately half of the observations
in both samples lie above the joint median, and the other half lies below
it.
### Wilcoxon rank sum test, or Mann-Whitney U test {#sec:wilcoxon-rank-sum}
The Wilcoxon rank sum test is equivalent to the Mann-Whitney U test.
Both can be viewed as nonparametric, ordinal counterparts
of the *t*-test for unpaired, independent observations
(§\@ref(sec:ttest-indep)).
Let us say that we want to investigate whether certain text attributes
have an influence on the subjective appreciation of the text. For this,
a researcher selects a random sample of participants
from the population (see
§\@ref(sec:random-samples)), and assigns these participants in a random
manner to two experimental conditions (randomisation, see
§\@ref(sec:internalvalidity), point 5).\
In the first condition, the participant has to give a judgement about
the original version of a text. In the second condition, the participants
give a judgement about the rewritten version of the same text.
The higher the given score, the higher the valuation for the text.
One of the participants unfortunately had to leave the study
prematurely. The judgements of the remaining 19 participants
are in Table \@ref(tab:data-originalrewritten). On the basis of the random
sample and the random assignment of participants to conditions,
the judgements can be seen as coming from two different
random samples. The null hypothesis is that there is no difference
in valuation between the two conditions.
Table: (#tab:data-originalrewritten) Judgements of $N=19$ participants on
the original and rewritten versions of a text.
Condition
------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
Original 10 17 35 2 19 4 18 28 24 --
Rewritten 15 22 8 48 29 25 27 39 31 36
The Wilcoxon rank sum test is based on the *ranking* of the
observations. Each observation is replaced by the ranking of
that observation, taken over the two conditions together. The lowest or
smallest value gets ranking 1. We indicate the sum of the rankings of the
smallest group (here: of the original condition) with
$W_1$. The probability distribution of $W$ under H0 is known (exactly for small
$n_1$ and $n_2$, and approximately for larger samples). With this, we can
determine the probability of encountering the value found of $W_1$,
or a more extreme value, if H0 is true.
Earlier, we saw that the *t*-test for unpaired observations
(§\@ref(sec:ttest-indep)) investigates whether the *means*
are different for two samples. Analogously, the Wilcoxon
rank sum test (and the Mann-Whitney $U$ test) investigates whether the
*medians* are different for the two samples. The test is thus more
robust for outliers --- if we were to replace the highest judgement (48)
with a much higher judgement (say 480), then that would have no influence
on the median of that group, nor on the test statistic or its
significance.
For our example, we find that the lower rankings occur relatively frequently
in the first condition (original version), i.e. that the text in this condition
received lower judgements. The sum of the rankings for this
smallest condition is the test statistic $W_1=67$. In some versions of the
test[^fn17-1], this raw sum is used to calculate the significance.
In other test versions[^fn17-2], this raw sum is firstly corrected for the
minimal value of $W_1$ (see the formulas below): the test statistic
is then $U=W_1 - \textrm{min}(W_1) = 67-45=22$. Afterwards, the significance
of $W_1=67$ or of $U=22$ is calculated. We find that $p=.07$. If we do a
two-sided test (H0: judgements in conditions 2 are no higher and no lower
than those in condition 1) with $\alpha=.05$, then there is no reason to
reject H0[^fn17-3].
#### formulas
For the sums of the rankings, it is the case that
$W_1 + W_2 = (n_1+n_2) (n_1+n_2+1) / 2$.
If all the lowest rankings (i.e. all lowest judgements) are in the smallest (first) condition,
then $W_1$ has the minimal value of
$n_1 (n_1+1) /2$.
If all the highest rankings (i.e. all the highest judgements) are in this condition,
then $W_1$ has the maximum value of
$n_1 (n_1+n_2+1) / 2$.
$W_1$ (and the minimum and maximum of it) can only be integer numbers.
It is useful to not only calculate $W_1$ or $U$, but also the
corresponding value of $z$ [@Ferg89]:
\begin{equation}
(\#eq:Wilcoxon-ranksum)
\bar{W_1} = \frac{ n_1 (n_1+n_2+1) }{ 2 }
\end{equation}
\begin{equation}
(\#eq:Wilcoxon-ranksum-z)
z = \frac{ |W_1-\bar{W_1}|-\frac{1}{2} }{ \sqrt{ \frac{n_1 n_2 (n_1+n_2+1)}{12} } }
\end{equation}
With this, we again determine the effect size, using equation \@ref(eq:es-z2r).
For the above example, we find
$\bar{W_1}=22.5$, $z=1.84$, and $r=.42$, which indicates a
'medium' effect. That this considerable effect still does
not lead to a significant difference (with two-sided testing) is presumably
a consequence of the (too) small size of the two groups.
### Kruskall-Wallis H test
The Kruskall-Wallis H test can be viewed as an expansion of the
Wilcoxon rank sum test (see
§\@ref(sec:wilcoxon-rank-sum) above), for $k \ge 2$ independent samples
or groups. The test can also be used to compare $k=2$ groups;
in this case, the test is completely equivalent to the Wilcoxon rank sum
test above. The Kruskall-Wallis H test can be viewed as the
nonparametric, ordinal counterpart of a one-way analysis of variance
(see §\@ref(sec:anova-oneway-explanation)). Put loosely: we carry out a kind of
variance analysis, not on the observations themselves but on the rankings of the
observations. We calculate $H$ as the test statistic based on the rankings
of the observations in the $k$ different
groups.
#### formula
\begin{equation}
(\#eq:kruskall-wallis-H)
H = \frac{12}{N(N+1)} \sum^{k} (\frac{R^2_j}{n_j}) - 3(N+1)
\end{equation}
where $R_j$ refers to the *sum* of the rankings of the observations
in group $j$, and $n_j$ refers to the size of the group $j$.
(For convenience, we disregard 'ties'
which are instances in which the same value and ranking occurs in
multiple groups.)
The test statistic $H$ has a probability distribution which resembles that of
$\chi^2$, with $k-1$ degrees of freedom. The significance of the
test statistic $H$ is thus determined via the probability distribution of
$\chi^2$ (see Appendix \@ref(app-criticalchi2values)).
This approximation via $\chi^2$ however only works if $k\ge3$
and $n_j\ge5$ for the smallest group [@Ferg89].
If $k=2$ or $n_j<5$ then the probability $P(H)$ is calculated exactly.
[^fn17-1]: Wilcoxon rank sum test in SPSS.
[^fn17-2]: Mann-Whitney test in SPSS and in R, and Wilcoxon rank sum test in R.
[^fn17-3]: If we do a two-sided test with $\alpha=.10$, then we could indeed
reject H0. If we do a one-sided test (H0: judgements in condition 2 are not higher
than in condition 1), then we may halve the calculated $p$, since the calculated
$p$ assumes two-sided testing. We would then find $p=.0653/2=.0326$, and,
as this probability is smaller than $\alpha=.05$, we would then indeed
be able to reject H0.