-
Notifications
You must be signed in to change notification settings - Fork 3
/
Lesson12.Rmd
1377 lines (1012 loc) · 57.4 KB
/
Lesson12.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Lesson 12: Inference for the Mean of Differences (Two Dependent Samples)"
output:
html_document:
theme: cerulean
toc: true
toc_float: false
---
<script type="text/javascript">
function showhide(id) {
var e = document.getElementById(id);
e.style.display = (e.style.display == 'block') ? 'none' : 'block';
}
</script>
<div style="width:50%;float:right;">
#### Optional Videos for this Lesson {.tabset .tabset-pills}
##### Part 1
<iframe id="kaltura_player_1645637245" src="https://cdnapisec.kaltura.com/p/1157612/sp/115761200/embedIframeJs/uiconf_id/47306393/partner_id/1157612?iframeembed=true&playerId=kaltura_player_1645637245&entry_id=1_a8mu9enz" width="480" height="270" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" frameborder="0"></iframe>
##### Part 2
<iframe id="kaltura_player_1645637296" src="https://cdnapisec.kaltura.com/p/1157612/sp/115761200/embedIframeJs/uiconf_id/47306393/partner_id/1157612?iframeembed=true&playerId=kaltura_player_1645637296&entry_id=1_bfblvjiv" width="480" height="270" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" frameborder="0"></iframe>
##### Part 3
<iframe id="kaltura_player_1645637532" src="https://cdnapisec.kaltura.com/p/1157612/sp/115761200/embedIframeJs/uiconf_id/47306393/partner_id/1157612?iframeembed=true&playerId=kaltura_player_1645637532&entry_id=1_ghir5682" width="480" height="270" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" frameborder="0"></iframe>
####
</div><div style="clear:both;"></div>
## Lesson Outcomes
By the end of this lesson, you should be able to do the following:
1. Recognize when a mean of differences (two dependent samples) inferential procedure is appropriate
2. Create numerical and graphical summaries of the data
3. Perform a hypothesis test for the mean of differences (two dependent samples) using the following steps:
a. State the null and alternative hypotheses
b. Calculate the test-statistic, degrees of freedom and P-value of the test using software
c. Assess statistical significance in order to state the appropriate conclusion for the hypothesis test
d. Check the requirements for the hypothesis test
4. Create a confidence interval for the mean of differences (two dependent samples) using the following steps:
a. Calculate a confidence interval using software
b. Interpret the confidence interval
c. Check the requirements of the confidence interval
<br>
## Example of Paired Data: Pre- and Post-test Scores
In education, it is very common for researchers to conduct studies in which they administer a pre-test, provide some instruction, and then give a post-test. The difference between the post- and pre-test scores is a measure of the student's progress. In this case, it would not make much sense to only look at the mean score on the pre-test and compare it to the mean score on the post-test.
This is called a **matched-pairs** design or we say we have **dependent samples**. Matched-pairs (or **paired-data**) designs typically involve only one population, and a pair of observations is drawn on the individuals selected for the sample. In the context of the educational study, the two observations are student's scores on (1) the pre-test and (2) the post-test. If a student is selected to participate in the pre-test (i.e., they are selected to be part of group 1), they are automatically selected to participate in the post-test (i.e., they are chosen to be in group 2 automatically.)
There is a lot of merit in subtracting the individual scores and looking at the mean *gain*.
The researchers are not really interested in the students knowledge before the instruction. This is used as a baseline to measure how much was gained during the instruction. There is great value in looking at the difference. This removes the effect of the individual students' ability, and it measures their learning during the unit.
To analyze the data, the researchers first find the difference in the post- and pre-test scores. At that point, the data have been reduced to a list of numbers (representing the increase in scores). Now, the researchers can conduct inference on the mean of these values. In other words, they can do a hypothesis test for the mean of the difference in the post- and pre-test scores.
A hypothesis test for two means with paired data (dependent samples) is conducted in the same way as a hypothesis test for a single mean with $\sigma$ unknown. The only exception is that the pairs of data must be subtracted before you start any computations. From a practical perspective, after you subtract, then you apply the one-sample procedures you have already learned. So, there is nothing new that you need to learn to compute a confidence interval for two means with paired data; just that we will be using a different sheet in the Math221 Statistics Toolbox that automatically calculates the differences.
We will first explore an application of pre- and post-testing in a weight loss study.
## Hypothesis Tests
<img src="./Images/StepsAll.png">
### Mahon's Weight Loss Study
**Background**
Annie Mahon and other researchers in Wayne Campbell's nutrition lab studied the weight loss of $n=27$ middle aged women who consumed a prescribed low-calorie diet. <!--<cite>Mahon07</cite>--> The women's weights were recorded (in kilograms) at the beginning of the study and after the nine-week diet period. The data are given in the file [Mahon.xlsx](./Data/Mahon.xlsx). An excerpt of the data is given below.
<table>
<thead>
<tr class="header">
<th><p>Subject</p></th>
<th><p>Pre</p></th>
<th><p>Post</p></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><p>1</p></td>
<td><p>62.5</p></td>
<td><p>56.1</p></td>
</tr>
<tr class="even">
<td><p>2</p></td>
<td><p>88.8</p></td>
<td><p>80.2</p></td>
</tr>
<tr class="odd">
<td><p>3</p></td>
<td><p>74.7</p></td>
<td><p>70.8</p></td>
</tr>
<tr class="even">
<td><p>$\vdots$</p></td>
<td><p>$\vdots$</p></td>
<td><p>$\vdots$</p></td>
</tr>
<tr class="odd">
<td><p>26</p></td>
<td><p>76.3</p></td>
<td><p>73.8</p></td>
</tr>
<tr class="even">
<td><p>27</p></td>
<td><p>82.1</p></td>
<td><p>77.9</p></td>
</tr>
<tr class="odd">
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
<!--
{| class="basic"
! Subject !! Pre !! Post
|-
|1 || 62.5 || 56.1
|-
|2 || 88.8 || 80.2
|-
|3 || 74.7 || 70.8
|-
|$\vdots$ || $\vdots$ || $\vdots$
|-
|26 || 76.3 || 73.8
|-
|27 || 82.1 || 77.9
|-
|}
-->
Notice the structure of the data. The weight of each subject was measured before the study and at the conclusion of the study. Each person provided a pre-study weight and a post-study weight. Stated differently, the pre-study weights and the post-study weights are paired. For each row of data, both of these numbers came from the same person. When we collect two observations of the same measurement on each subject, we call it **paired data**. Sometimes paired data are called **dependent samples**.
<div class="QuestionsHeading">Answer the following question:</div>
<div class="Questions">
1. The researchers measured the initial weights of the women prior to the study, even though they were not particularly interested in this value. What was the purpose of measuring the pre-study weights?
<a href="javascript:showhide('Q1')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q1" style="display:none;">
* The goal of the study is to determine how much the women's weight change as as result of the study. The researchers must measure the women's weights at the beginning of the study, so they can subtract the initial (pre-study) weight of each woman from her final (post-study) weight.
</div>
</div>
<br>
**Computing New Variables in Excel**
The researchers are not interested in the weights of the women, they are more interested in the *change* in the women's weights. This will give them a measure of the effectiveness of the low-calorie diet. In other words, they are interested in the difference of the weights after the study compared with before:
$$\text{Difference} = \text{Post} - \text{Pre}$$
We can calculate the difference for each woman in the study:
<table>
<thead>
<tr class="header">
<th><p>Subject</p></th>
<th><p>Post</p></th>
<th><p>Pre</p></th>
<th><p>Difference</p></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><p>1</p></td>
<td><p>56.1</p></td>
<td><p>62.5</p></td>
<td><p>56.1 $-$ 62.5 = -6.4</p></td>
</tr>
<tr class="even">
<td><p>2</p></td>
<td><p>80.2</p></td>
<td><p>88.8</p></td>
<td><p>80.2 $-$ 88.8 = -8.6</p></td>
</tr>
<tr class="odd">
<td><p>3</p></td>
<td><p>70.8</p></td>
<td><p>74.7</p></td>
<td><p>70.8 $-$ 74.7 = -3.9</p></td>
</tr>
<tr class="even">
<td><p>$\vdots$</p></td>
<td><p>$\vdots$</p></td>
<td><p>$\vdots$</p></td>
<td><p>$\vdots$</p></td>
</tr>
<tr class="odd">
<td><p>26</p></td>
<td><p>73.8</p></td>
<td><p>76.3</p></td>
<td><p>73.8 $-$ 76.3 = -2.5</p></td>
</tr>
<tr class="even">
<td><p>27</p></td>
<td><p>77.9</p></td>
<td><p>82.1</p></td>
<td><p>77.9 $-$ 82.1 = -4.2</p></td>
</tr>
</tbody>
</table>
<!-- {| class="basic" -->
<!-- ! Subject !! Post !! Pre !! Difference -->
<!-- |- -->
<!-- | 1 || 56.1 || 62.5 || 56.1 $-$ 62.5 = -6.4 -->
<!-- |- -->
<!-- | 2 || 80.2 || 88.8 || 80.2 $-$ 88.8 = -8.6 -->
<!-- |- -->
<!-- | 3 || 70.8 || 74.7 || 70.8 $-$ 74.7 = -3.9 -->
<!-- |- -->
<!-- | $\vdots$ || $\vdots$ || $\vdots$ || $\vdots$ -->
<!-- |- -->
<!-- | 26 || 73.8 || 76.3 || 73.8 $-$ 76.3 = -2.5 -->
<!-- |- -->
<!-- | 27 || 77.9 || 82.1 || 77.9 $-$ 82.1 = -4.2 -->
<!-- |} -->
<a name="SubtractDifferences"></a>
<!-- To access this content, scroll to the bottom of the editing page and click on the link "Software:(Excel or SPSS)-(PageName)" -->
<div class="SoftwareHeading">Excel Instructions</div>
<div class="Summary">
Fortunately, the "Paired Data t-test" tab in the [Math 221 Statistics Toolbox](./Data/Math221StatisticsToolbox.xlsx) will automatically compute the differences when you paste in the data. **The Toolbox always takes the data in column A - data in column B.** Because we want to take Post - Pre, you will need to swap the order of the columns when pasting the data into the Math221 Statistics Toolbox. Follow this process:
* Copy the "Pre" column and paste it into column B, labeled "Data2" of the Toolbox.
* Copy the "Post" column and paste it into column A, labeled "Data1" of the Toolbox.
* Notice in column C the differences are automatically calculated.
* A excerpt of how the data looks in the Math221 Statistics Toolbox is shown below:
<img src="./Images/Mahon-Differences_Excel_Toolbox.PNG">
</div>
<br>
<div class="QuestionsHeading">Answer the following questions:</div>
<div class="Questions">
2. Following the directions above, compute the difference in the women's weights by pasting the data in the Math221 Statistics Toolbox.
<br>
3. What is the mean of the values in the *Difference* column? (Look in cell G7)
<a href="javascript:showhide('Q3')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q3" style="display:none;">
<center>
$$
-6.80 \text{ kg}
$$
</center>
</div>
<br>
4. Interpret the value you calculated in Question 3.
<a href="javascript:showhide('Q4')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q4" style="display:none;">
* The mean weight change experienced by the women in the study was $-6.80$ kg. It can be tricky to know if this means they gained or lost weight. Remember, we calculated Post-Pre. A negative difference indicates the Pre weight was higher than the Post weight. In other words, there was a mean weight loss of $6.80$ kg.
</div>
</div>
<br>
</div>
<br>
**Relationship to a One Sample t-test**
After you have subtracted the pre-study weights from the post-study weights, you are left with a column of differences. We will denote the pre-study weights by $x_1$ and the post-study weights by $x_2$. Then, the differences can be denoted as $d = x_2 - x_1$. The difference, $d$, is defined as the change in the volunteer's weight during the study.
After computing the differences, we do not use the data for the individual groups at all. The researchers are not interested in the values of the women's weights at the beginning of the study or at the end of the study. They are mostly interested in the difference in the weights after the participants complete the study.
After we subtract, we can conduct a hypothesis test to determine if the mean of the differences is less than zero. We use the symbol $\mu_d$ to represent the true mean difference in the weights of the women who follow the diet prescribed in this study. The null hypotheses is that the true mean difference is zero ($\mu_d = 0$). The alternative hypothesis is that there is a decrease in the weights, in other words, that the true mean difference is less than zero ($\mu_d < 0$).
Notice that this is essentially a one-sample t-test where the data are the differences in the women's weights. We have one column of data, the differences. We are testing whether the true mean difference is less than zero. After subtracting, a test for a difference of two means with paired data is just like a test for one mean with $\sigma$ unknown.
In the hypothesis test, we will refer to the variable representing the differences as $d$. We will use this notation throughout the hypothesis test. For example, the true population mean will be labeled $\mu_d$ and the sample mean will be labeled $\bar d$. The sample standard deviation of the differences is denoted $s_d$.
**Hypothesis Test for Mahon's Weight Loss Data**
<img src="./Images/Step1.png">
**Summarize the relevant background information**
Twenty-seven women participated in a nine week weight loss study. During the study period, the participants were provided a reduced calorie diet. Their weights were recorded at the beginning of the study and nine weeks later. The difference of the weights is defined as the post-study weights minus the pre-study weights. The researchers expected that the mean difference in the weights would be negative--in other words, that the women would tend to lose weight.
**State the null and alternative hypotheses and the level of significance**
$$
\begin{align}
H_0: &~~ \mu_d=0 \\
H_a: &~~ \mu_d < 0
\end{align}
$$
We will use the $\alpha = 0.05$ level of significance.
<img src="./Images/Step2.png">
**Describe the data collection procedures**
The women's weights were recorded at the beginning of the study. The women were provided a reduced calorie diet for nine weeks. Then, their weights were measured again at the end of the study. A calibrated scale was used to provide an accurate weight.
<img src="./Images/Step3.png">
**Give the relevant summary statistics**
Here is the Excel output:
<img src="./Images/Mahon_Paired_t-test_Excel_Toolbox.PNG">
From the Excel output illustrated above, we can see a histogram of the data and get the following numerical summaries:
$$
\begin{align}
\bar d &= -6.80 \\
s_d &= 3.17 \\
n &= 27
\end{align}
$$
The mean and standard deviation are rounded to one decimal place more than the original data.
<img src="./Images/Step4.png">
**Verify the requirements have been met**
Like the one-sample t-test, this procedure is robust, meaning that it is not very sensitive to the requirements. If they are violated, it will probably still give reasonably good results.
The requirements for this procedure are the same as the requirements for a one-sample t-test:
- the data represent a simple random sample from the population
- the mean of the differences follows a normal distribution
The subjects were recruited via advertisements for a research study. The participants volunteered to participate. It is not a simple random sample of all middle-aged women, but there is nothing about the selection of the sample that would invalidate the results.
From a practical perspective, it is impossible to get a simple random sample of people in the general population. When research trials are conducted, people must volunteer to participate. This can lead to a selection bias, but it is usually negligible.
The requirement of normality is satisfied for Mahon's data. Though the sample size (n=27) is not quite up to 30, it is still fairly large. Furthermore, the histogram of differences indicates very little skew, so $\bar d$ will be approximately normal.
<center>
<img src="./Images/Mahon-Differences-Histogram.PNG">
</center>
Even with a sample size less than 30, we can still conduct this test.
**Give the test statistic and its value**
The test statistic for a test involving paired data when $\sigma$ is unknown is a $t$. For this situation, the value is:
$$t= \frac{-6.8 - 0}{3.17/\sqrt{27}} =-11.145$$
See that this calculation matches the test statistic given in the Excel output. The degrees of freedom and p-value can also be found in the Excel output.
**State the degrees of freedom**
$$df = 26$$
**Mark the test statistic and $P$-value on a graph of the sampling distribution**
The test statistic and p-value can be found in the Excel output. The following calculations show conceptually how the p-value is calculated:
The test statistic, $t$, is labeled on the horizontal axis. The $P$-value is the area to the left of $t$ under the curve. This area is so small, it is hiding out on the edges (not actuall visible) on this plot.
<img src="./Images/Mahon-Applet.png">
It is important to note that only the left tail is shaded, even though we cannot see it in this illustration.
**Find the $P$-value and compare it to the level of significance**
$$
P\text{-value} = 1.06 \times 10^{-11} < 0.05 = \alpha
$$
**State your decision**
Since the $P$-value is less than the level of significance, we reject the null hypothesis.
<img src="./Images/Step5.png">
**Present your conclusion in an English sentence, relating the result to the context of the problem**
There is sufficient evidence to suggest that the reduced calorie diet used in this study results in weight loss for middle-aged women.
<br>
<img src="./Images/StepsAll.png">
### Nosocomial Infections
<span id='17:IntroToNosocomialInfections'></span>
<img src="./Images/Step1.png">
**Summarize the relevant background information**
Matched-pairs designs are not just used in pre- and post-test situations. They are often used in situations where it is not possible to randomly assign subjects to groups (for example, by a coin toss.) Nosocomial (pronounced: NO-suh-KOH-MEE-uhl) infections are infections that occur in hospitals, but are not a result of the original condition. An example of a nosocomial infection is when a heart attack patient develops a staph infection at the site of an IV injection. The infection was not caused by the heart attack, but it was acquired in the hospital. Nosocomial infections are very dangerous and may result in longer recovery times or increased death rates.
<img src="./Images/Pneumonia-CDC-5803.png">
Health care providers suspect that nosocomial infections increase the amount of time required to recover from an illness or injury. In controlled experiments, subjects (e.g., patients) are randomly assigned to treatments. However, it is not ethical to give patients a nosocomial infection in order to determine if it increases the duration of their hospital stay! At best, we can collect information on the duration of hospital stays for patients who acquire nosocomial infections and compare them to the duration of the stays for patients who do not.
There are many factors that affect the amount of time that a patient will need to stay in the hospital, including: nature of illness, types of procedures conducted, overall health, gender, age, etc. How can health care practitioners assess the effect of a nosocomial infection in the presence of so many other variables?
One way is to match a patient who develops a nosocomial infection with another one who has similar characteristics (illness, procedures, health, gender, age group, etc.) but does not develop a nosocomial infection. Now, the patients are matched into pairs with similar characteristics, where the principle difference between the members of each pair is whether or not they acquired a nosocomial infection.
By pairing the patients according to specific characteristics, the researchers can now subtract to observe a difference in their recovery times. In this way, it is possible to assess if nosocomial infections increase the mean duration of a hospital stay. Some researchers conducted such a study in which 52 pairs of patients were matched based on clinical characteristics. A patient with a nosocomial infection was matched as closely as possible to a similar case where there was no nosocomial infection. Patients who died were excluded from the study <!--<cite>Vegas93</cite>-->. The lengths of the hospital stays (in days) for these patients are given in the file [NosocomialInfections.xlsx](./Data/NosocomialInfections.xlsx).
The difference, $d$, is defined as the duration of the hospital stay of the individual in the pair with the nosocomial infection minus the duration of the stay for the individual who did not get a nosocomial infection:
$$
Difference=Infected - NotInfected
$$
After computing the differences, we do not use the data for the individual groups at all. In fact, after we subtract, the hypothesis test is conducted (essentially) like a one-sample test for a single mean with $\sigma$ unknown.
<div class="QuestionsHeading">Answer the following questions:</div>
<div class="Questions">
5. **State the null and alternative hypotheses and the level of significance**
<a href="javascript:showhide('Q5')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q5" style="display:none;">
<center>
$$
\begin{align}
H_0: &~~ \mu_d = 0 \\
H_a: &~~ \mu_d > 0 \\
\end{align}
$$
</center>
* The level of significance was not specified in the problem. You can choose any value you wish. The most common choices are 0.05, 0.01 and 0.1. We will illustrate this example with $\alpha = 0.05$.
</div>
</div>
<br>
In order to get the correct $P$-value, we need to indicate the proper alternative hypothesis in Excel. In cell N6 be sure the "Greater Than" symbol is selected in the drop-down menu.
<img src="./Images/TypeOfTest-GreaterThan-Excel.png">
<br>
<img src="./Images/Step2.png">
<div class="QuestionsHeading">Answer the following questions:</div>
<div class="Questions">
6. **Describe the data collection procedures**
<a href="javascript:showhide('Q6')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q6" style="display:none;">
* Data were collected by matching hospital records of individuals who were admitted to the hospital. Patient records were matched based on their overall health and the reason they were admitted to the hospital. In each pair, one patient developed a nosocomial infection and one did not. Since the characteristics of the patients in the first group determined which patients would be paired with them in the second group, the data represent dependent samples.
</div>
</div>
<br>
<img src="./Images/Step3.png">
<div class="QuestionsHeading">Answer the following questions:</div>
<div class="Questions">
7. **Give the relevant summary statistics**
<a href="javascript:showhide('Q7')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q7" style="display:none;">
<center>
$$
\begin{align}
\bar d &= 11.38 \\
s_d &= 13.83 \\
n &= 52
\end{align}
$$
</center>
</div>
<br>
8. **Make an appropriate graph to illustrate the data**
<a href="javascript:showhide('Q8')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q8" style="display:none;">
- Present a graph showing the differences.
<center>
<img src="./Images/Nosocomial-Differences-Histogram.png">
</center>
</div>
</div>
<br>
<img src="./Images/Step4.png">
<div class="QuestionsHeading">Answer the following questions:</div>
<div class="Questions">
9. **Verify the requirements have been met**
<a href="javascript:showhide('Q9')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q9" style="display:none;">
* The data represent a random sample of patients, who have been matched based on their overall health and their current ailment. The sample size is large, so the mean of the differences $\bar d$ will be approximately normally distributed.
</div>
<br>
10. **Give the test statistic and its value**
<a href="javascript:showhide('Q10')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q10" style="display:none;">
* The test statistic for a test for two means with paired data is a $t$.
$$t = 5.935$$
</div>
<br>
11. **State the degrees of freedom**
<a href="javascript:showhide('Q11')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q11" style="display:none;">
<center>
$$
df = 51
$$
</center>
</div>
<br>
12. **Mark the test statistic and $P$-value on a graph of the sampling distribution**
<a href="javascript:showhide('Q12')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q12" style="display:none;">
* The test statistic and p-value can be found in the Excel output and you should not need to calculate it by hand. Your sketch should show the value of $t=5.935$ on the horizontal axis, with only the tiny area to the right of 5.935 shaded.
</div>
<br>
13. **Find the $P$-value and compare it to the level of significance**
<a href="javascript:showhide('Q13')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q13" style="display:none;">
The p-value can be found in the Excel output and you should not need to calculate it by hand. The calculation below shows how the p-value in Excel is being calculated:
<center>
$$
P\textrm{-value}=\frac{\textrm{Sig. (2-tailed)}}{2}=\frac{2.592\times 10^{-7}}{2}=1.296 \times 16^{-7} = 0.0000001296 < 0.05 = \alpha
$$
</center>
</div>
<br>
14. **State your decision**
<a href="javascript:showhide('Q14')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q14" style="display:none;">
* Since the $P$-value is less than the level of significance, we reject the null hypothesis.
</div>
</div>
<br>
<img src="./Images/Step5.png">
<div class="QuestionsHeading">Answer the following questions:</div>
<div class="Questions">
15. **Present your conclusion in an English sentence, relating the result to the context of the problem**
<a href="javascript:showhide('Q15')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q15" style="display:none;">
* There is sufficient evidence to suggest that the mean duration of hospital stays is increased when a patient develops a nosocomial infection.
</div>
</div>
<br>
### Additional Worked Examples
Viewing additional examples can help your understanding. Click on the link below to see two more examples of hypothesis tests.
<a href="javascript:showhide('ae')"><span style="font-size:8pt;">Show/Hide Additional Examples</span></a>
<div id="ae" style="display:none;">
<img src="./Images/StepsAll.png">
#### Effect of Stressful Classical Music on Your Metabolism
<img src="./Images/Step1.png">
**Summarize the relevant background information**
Obesity is a growing problem worldwide. Many scientists are seeking creative solutions to trim down this epidemic. Reduced energy expenditure is a potential cause of obesity.
Resting Energy Expenditure (REE) is defined as the amount of energy a person would use if resting for 24 hours. In essence, this is the amount of energy that a person's body will consume if they do not do any physical activity. REE is measured in terms of kilo-Joules per day (kJ/d).
REE accounts for approximately 70 to 80% of all energy that a person will expend in a day. <!--<cite>Carlsson05</cite>--> If researchers can find simple, enjoyable activities that will increase REE, it may be possible to minimize the spread of obesity around the world.
Ebba Carlsson and other researchers in Sweden investigated whether listening to stressful classical music increases a person's REE. <!--<cite>Carlsson05</cite>--> Each subject's REE was measuring during silence and again while listening to stressful classical music. Data representing their results are given in the file [REE-ClassicalMusic](./Data/REE-ClassicalMusic.xlsx).
Notice that this is not a pre- and post-test, but it is still a test involving paired data. Two REE measurements were made for each subject: (1) in silence ($REE_1$) and (2) while listening to stressful classical music ($REE_2$).
**State the null and alternative hypotheses and the level of significance**
Since we are testing for an increase in the mean REE, we let $d = REE_2 - REE_1$. Our alternative hypothesis will be that $\mu_d > 0$. The null and alternative hypotheses are:
$$
\begin{align}
H_0: &~~ \mu_d = 0 \\
H_a: &~~ \mu_d > 0
\end{align}
$$
Note that the data set has the columns listed with $d = REE_1$ in the first column and $d = REE_2$ in the second column. You will need to switch the order of the columns when pasting them into the Excel Toolbox.
We will use the $\alpha = 0.1$ level of significance.
<img src="./Images/Step2.png">
**Describe the data collection procedures**
The REE was measured by a technique called "indirect calorimetry" using a Deltatrac II Metabolic Monitor. <!--<cite>Carlsson05</cite>--> The REE was measured twice for each person: while the person was (1) resting in silence or (2) resting while listening to stressful classical music. These trials were conducted in random order. Some of the subjects had the "silence" treatment first, and others had the "stressful" treatment first.
<img src="./Images/Step3.png">
<div class="QuestionsHeading">Answer the following questions:</div>
<div class="Questions">
16. We will define the difference in REE by subtracting the REE in silence from the REE while listening to stressful classical music. If listening to stressful classical music actually increases the mean REE, would you expect the value of the difference to be typically positive or negative?
<a href="javascript:showhide('Q16')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q16" style="display:none;">
* If the REE is higher while listening to classical music than while resting in silence, we would expect the value of the difference to be positive. In other words the following difference would tend to be positive:
$$
Difference = Stressful - Silence
$$
</div>
<br>
17. Compute the difference in REE for each person. What is the value of the difference for the first person listed in the data file?
<a href="javascript:showhide('Q17')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q17" style="display:none;">
* 50 kJ/d
* Here is an illustration of an excerpt of the data in Excel:
<center>
<img src="./Images/REE-Data-Excel.png">
</center>
</div>
<br>
**Give the relevant summary statistics**
18. Report the number of subjects ($n$), the mean difference ($\bar d$), and the standard deviation of the differences ($s_d$).
<a href="javascript:showhide('Q18')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q18" style="display:none;">
The following image illustrates the Excel file used to get the summary statistics.
<center>
<img src="./Images/REE-Output-Excel_Toolbox.png">
</center>
$$
\begin{align}
n&=40\\
\bar d &= 20~\text{kJ}\\
s_d &= 160~\text{kJ}
\end{align}
$$
</div>
<br>
19. **See above output for a graph of the data**
</div>
<br>
<img src="./Images/Step4.png">
**Verify the requirements have been met**
We can consider the sample representative of the population. Because our sample size (n=40) of differences is greater than 30, we can be sure the sampling distribution of $\bar d$ will be normal.
The requirements for this test appear to have been satisfied.
<div class="QuestionsHeading">Answer the following questions:</div>
<div class="Questions">
20. **Give the test statistic and its value**
<a href="javascript:showhide('Q20')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q20" style="display:none;">
* The test statistic for a test for two means with paired data is a $t$.
$$t=0.793$$
</div>
<br>
21. **State the degrees of freedom**
<a href="javascript:showhide('Q21')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q21" style="display:none;">
<center>
$$
df = 39
$$
</center>
</div>
22. **Mark the test statistic and $P$-value on a graph of the sampling distribution**
<a href="javascript:showhide('Q22')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q22" style="display:none;">
* The test statistic is plotted on the horizontal axis. The $P$-value is shaded in green. The same value can be found on the Excel output in cell O10.
<center>
<img src="./Images/REE-Applet.png">
</center>
</div>
<br>
23. **Find the $P$-value and compare it to the level of significance**
<a href="javascript:showhide('Q23')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q23" style="display:none;">
<center>
<!--$$ P\textrm{-value}=\frac{0.433}{2}=0.216$$-->
$P\textrm{-value}=0.2163 > 0.1 = \alpha$
</center>
* Notice that the $P$-value is half as large for a one-tailed test as it would have been for a two-tailed test. Since we have a one-sided alternative hypothesis, we are only interested in the right tail of the $t$-distribution.
</div>
<br>
24. **State your decision**
<a href="javascript:showhide('Q24')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q24" style="display:none;">
* Since the $P$-value is greater than the level of significance, we fail to reject the null hypothesis.
</div>
</div>
<br>
<img src="./Images/Step5.png">
<div class="QuestionsHeading">Answer the following questions:</div>
<div class="Questions">
25. **Present your conclusion in an English sentence, relating the result to the context of the problem**
<a href="javascript:showhide('Q25')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q25" style="display:none;">
- There is insufficient evidence to suggest that the mean REE is *increased* by listening to stressful classical music. Lying still and listening to stressful classical music is probably not the best way to increase your metabolism!
</div>
</div>
<br>
Note that we did not say we "accept" the null hypothesis. We do not know that listening to stressful classical music has no effect on a person's REE. Based on the data available to us, we were not able to reject the assertion that this type of music does not increase the mean REE.
<br>
<img src="./Images/StepsAll.png">
#### Cost of Airline Tickets
<img src="./Images/Step1.png">
**Summarize the relevant background information**
Pressures of supply and demand act directly on the prices for an airline ticket. As the seats available on the plane begin to fill, airlines raise the price. If seats on a flight do not sell well, an airline may discount the tickets or even cancel the flight. Business travelers frequently demand travel booked on short notice. They must pay the current price. Typically, tourists book their flights well in advance, hoping to buy tickets before the price rises. We will consider the cost of a one-way ticket from London's Heathrow Airport to a variety of destinations in Europe.
Allie Henrich, a BYU-Idaho student, compared the lowest published ticket prices of one-way flights from Heathrow to various destinations in Europe. Using Travelocity.com, she recorded the lowest published fares for nonstop midweek flights booked either 14 days in advance or 90 days in advance. The prices (in US dollars) are given in the file [DirectFlightCosts.xlsx](./Data/DirectFlightCosts.xlsx). Notice that for some destinations, flights were not available.
The data are paired, because we are measuring the costs twice for each city. The 14-day ticket price is paired with the 90-day price for each city.
We will conduct a hypothesis test to determine if there is a difference in the cost of the nonstop flights when tickets are purchased 14 days in advance compared to 90 days in advance. We will use the 0.01 level of significance.
<div class="QuestionsHeading">Answer the following questions:</div>
<div class="Questions">
26. **State the null and alternative hypotheses and the level of significance**
<a href="javascript:showhide('Q26')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q26" style="display:none;">
<center>
$$
\begin{array}{1cl}
H_0:\mu_d = 0 \\
H_a:\mu_d \ne 0 \\
\alpha = 0.01
\end{array}
$$
</center>
</div>
</div>
<br>
<img src="./Images/Step2.png">
<div class="QuestionsHeading">Answer the following questions:</div>
<div class="Questions">
27. **Describe the data collection procedures**
<a href="javascript:showhide('Q27')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q27" style="display:none;">
* The data were collected using the website Travelocity.com. The lowest advertized ticket prices were recorded for nonstop flights from Heathrow Airport. All prices were recorded in US dollars. Data are provided on the cost of a nonstop ticket purchased with 14 days notice compared to 90 days notice.
* We will compute the difference in the costs for each destination. Some destinations did not include both flight options. In this case, the difference is not computed and the data are omitted from the analysis.
</div>
</div>
<br>
<img src="./Images/Step3.png">
<div class="QuestionsHeading">Answer the following questions:</div>
<div class="Questions">
28. **Give the relevant summary statistics**
<a href="javascript:showhide('Q28')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q28" style="display:none;">
* The differences were computed by subtracting the 90-day price from the 14-day price. For example, for the Adnan Menderes Airport, we have
$$202.09 - 234.19 = -32.10$$
* You may have chosen to subtract in the opposite order. If so, you would have obtained a value of $32.10$ dollars.
$$
\begin{align}
n&=87\\
\bar d &= 24.612\\
s_d &= 136.267
\end{align}
$$
<br>
<div class="message Note">If you defined your difference as the 90-day price minus the 14-day price, then you would have observed a value of $\bar d = -24.612$ dollars for the mean of the differences. You were not instructed on the order in which to subtract, so this is a correct response. The value for the standard deviation of the difference and the number of observations (pairs) will be the same as is given above.</div>
<br>
<br>
</div>
<br>
29. **Make an appropriate graph to illustrate the data**
<a href="javascript:showhide('Q29')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q29" style="display:none;">
<center>
<img src="./Images/DirectFlightCosts-Histogram1-Excel.png">
</center>
* If you defined your difference as the 90-day price minus the 14-day price, then you would have the following histogram:
<center>
<img src="./Images/DirectFlightCosts-Histogram2-Excel.png">
</center>
</div>
</div>
<br>
<img src="./Images/Step4.png">
<div class="QuestionsHeading">Answer the following questions:</div>
<div class="Questions">
30. **Verify the requirements have been met**
<a href="javascript:showhide('Q30')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q30" style="display:none;">
<!-- This is not a simple random sample of airports. Rather, the sample was chosen from the list of the busiest airports in Europe. However, we are not making an inference on the airports but on the difference in the cost of the flights. -->
- The sample size is large, so we can conclude that the sample mean, $\bar d$ is normally distributed.
</div>
<br>
31. **Give the test statistic and its value**
<a href="javascript:showhide('Q31')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q31" style="display:none;">
* The test statistic for a test for two means with paired data is a $t$.
$$t=1.685$$
* If you computed the difference as the 90-day price minus the 14-day price, the value of your test statistic is $-1.685$.
</div>
<br>
32. **State the degrees of freedom**
<a href="javascript:showhide('Q32')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q32" style="display:none;">
<center>
$$
df = 86
$$
</center>
</div>
<br>
33. **Mark the test statistic and $P$-value on a graph of the sampling distribution**
<a href="javascript:showhide('Q33')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q33" style="display:none;">
- The test statistic is plotted on the horizontal axis. The $P$-value is shaded in green:
<center>
<img src="./Images/DirectFlightCosts-Applet.png">
</center>
</div>
<br>
34. **Find the $P$-value and compare it to the level of significance**
<a href="javascript:showhide('Q34')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q34" style="display:none;">
<center>
$$
P\textrm{-value}= 0.096 > 0.01 = \alpha
$$
</center>
The $P$-value will be 0.096, no matter what order you subtracted the values.
</div>
<br>
35. **State your decision**
<a href="javascript:showhide('Q35')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q35" style="display:none;">
Since the $P$-value is greater than the level of significance, we fail to reject the null hypothesis.
</div>
</div>
<br>
<img src="./Images/Step5.png">
<div class="QuestionsHeading">Answer the following questions:</div>
<div class="Questions">
36. **Present your conclusion in an English sentence, relating the result to the context of the problem**
<a href="javascript:showhide('Q36')"><span style="font-size:8pt;">Show/Hide Solution</span></a>
<div id="Q36" style="display:none;">
* There is insufficient evidence to suggest that there is a difference in the mean cost of airline tickets 14-days versus 90-days in advance.
</div>
</div>
<br>
## Confidence Intervals
We can compute a confidence interval for the true mean of the differences for paired data. After the differences between two paired data sets have been calculated, we can create a confidence interval for the true mean of the differences. To do this, we follow the instructions for creating a confidence interval for a one mean with $\sigma$ unknown, but we use the column of differences as the data set.
<div class="SoftwareHeading">Excel Instructions</div>
<div class="Summary">
**To calculate confidence intervals for the true mean of the difference in Excel, do the following**:
* Open the file [Math 221 Statistics Toolbox](./Data/Math221StatisticsToolbox.xlsx)
* Click on the tab labeled "Paired Data t-test"
* Enter the columns of paired data into column A and B
* Set the desired confidence level.
<br>
</div>
<br>
The requirements for creating a confidence interval for the difference of means are the same as the requirements for the hypothesis test. We assume:
* A simple random sample was drawn from the population
* The mean of the differences is normally distributed
<img src="./Images/StepsAll.png">
<img src="./Images/PineBeetleDamage-1441150-LG.png">
### Mountain Pine Beetle Attacks
<img src="./Images/Step1.png">
**Summarize the relevant background information**