-
Notifications
You must be signed in to change notification settings - Fork 2
/
02_Probability.Rmd
1108 lines (888 loc) · 54.8 KB
/
02_Probability.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Probability
```{r, echo=FALSE}
# Unattach any packages that happen to already be loaded. In general this is unecessary
# but is important for the creation of the book to not have package namespaces
# fighting unexpectedly.
pkgs = names(sessionInfo()$otherPkgs)
if( length(pkgs > 0)){
pkgs = paste('package:', pkgs, sep = "")
for( i in 1:length(pkgs)){
detach(pkgs[i], character.only = TRUE, force=TRUE)
}
}
# Set my default chunk options
knitr::opts_chunk$set( fig.height=3, cache=TRUE )
```
```{r, message=FALSE, warning=FALSE}
# Every chapter, we will load all the librarys we will use at the beginning
# of the chapter.
library(ggplot2) # graphing functions
library(dplyr) # data summary tools
# Set default behavior of ggplot2 graphs to be black/white theme
theme_set(theme_bw())
```
We need to work out the mathematics of what we mean by probability. To begin with we first define an outcome. An outcome is one observation from a random process or event. For example we might be interested in a single roll of a six-side die. Alternatively we might be interested in selecting one NAU student at random from the entire population of NAU students.
## Introduction to Set Theory
Before we jump into probability, it is useful to review a little bit of set theory.
Events are properties of a particular outcome. For a coin flip, the event “Heads” would be the event that a heads was flipped. For the single roll of a six-sided die, a possible event might be that the result is even. For the NAU student, we might be interested in the event that the student is a biology student. A second event of interest might be if the student is an undergraduate.
1.1.1 Venn Diagrams
Let $S$ be the set of all outcomes of my random trial. Suppose I am interested in two events $A$ and $B$. The traditional way of representing these events is using a Venn diagram.
```{r, echo=FALSE}
df.A <- data.frame(theta=seq(0,2*pi, length=1000)) %>%
mutate(x = cos(theta),
y = sin(theta),
grp = 'A') %>%
select(-theta)
df.B <- df.A %>% mutate( x = x+1, grp='B')
df.box <- data.frame( x=c(-1.2, 2.2, 2.2, -1.2, -1.2),
y=c(-1.2, -1.2 ,1.2, 1.2, -1.2),
grp='Box' )
df <- rbind(df.A, df.B, df.box)
df.label <- NULL %>%
rbind( data.frame(x= -.75, y= .85, label='A'),
data.frame(x= 1.75, y= -.85, label='B'),
data.frame(x= 2.1, y= 1.1, label='S'))
ggplot(df, aes(x=x, y=y)) +
geom_path( aes(group=grp)) +
scale_x_continuous(breaks=NULL) +
scale_y_continuous(breaks=NULL) +
geom_text( data=df.label, aes(label=label)) +
labs(x='', y='') + theme_minimal()
```
For example, suppose that my random experiment is rolling a fair 6-sided die once. The possible outcomes are $S=\{1,2,3,4,5,6\}$. Suppose I then define events $A=$ roll is odd and $B=$ roll is 5 or greater. In this case our picture is:
```{r, echo=FALSE}
df.A <- data.frame(theta=seq(0,2*pi, length=1000)) %>%
mutate(x = cos(theta),
y = sin(theta),
grp = 'A') %>%
select(-theta)
df.B <- df.A %>% mutate( x = x+1, grp='B')
df.box <- data.frame( x=c(-1.2, 2.2, 2.2, -1.2, -1.2),
y=c(-1.2, -1.2 ,1.2, 1.2, -1.2),
grp='Box' )
df <- rbind(df.A, df.B, df.box)
df.label <- NULL %>%
rbind( data.frame(x= -.75, y= .85, label='A'),
data.frame(x= 1.75, y= -.85, label='B'),
data.frame(x= 2.1, y= 1.1, label='S'))
df.nums <- NULL %>%
rbind( data.frame(x= -.7, y=0, label='1,3'),
data.frame(x= .5, y=0, label='5'),
data.frame(x= 1.5, y=0, label='6'),
data.frame(x= .5, y=-1, label='2,4'))
ggplot(df, aes(x=x, y=y)) +
geom_path( aes(group=grp)) +
scale_x_continuous(breaks=NULL) +
scale_y_continuous(breaks=NULL) +
geom_text( data=df.label, aes(label=label)) +
geom_text( data=df.nums, aes(label=label)) +
labs(x='', y='') + theme_minimal()
```
All of our possible events are present, and distributed among our possible events.
### Composition of events
I am often interested in discussing the composition of two events and we give the common set operations below.
* Union: Denote the event that either $A$ or $B$ occurs as $A\cup B$.
```{r, echo=FALSE}
df.A <- data.frame(theta=seq(0,2*pi, length=1000)) %>%
mutate(x = cos(theta),
y = sin(theta),
grp = 'A') %>%
select(-theta)
df.B <- df.A %>% mutate( x = x+1, grp='B')
df.box <- data.frame( x=c(-1.2, 2.2, 2.2, -1.2, -1.2),
y=c(-1.2, -1.2 ,1.2, 1.2, -1.2),
grp='Box' )
df <- rbind(df.A, df.B, df.box)
df.label <- NULL %>%
rbind( data.frame(x= -.75, y= .85, label='A'),
data.frame(x= 1.75, y= -.85, label='B'),
data.frame(x= 2.1, y= 1.1, label='S'))
ggplot(df, aes(x=x, y=y)) +
geom_path( aes(group=grp)) +
geom_polygon(data=df%>%filter(grp!='Box'),
aes(y=y, group=grp), alpha=.4, fill='grey') +
scale_x_continuous(breaks=NULL) +
scale_y_continuous(breaks=NULL) +
geom_text( data=df.label, aes(label=label)) +
labs(x='', y='') +
ggtitle( expression(A*union(B))) +
theme_minimal()
```
* Denote the event that **both** $A$ and $B$ occur as $A\cap B$
```{r, echo=FALSE}
df.A <- data.frame(theta=seq(0,2*pi, length=1000)) %>%
mutate(x = cos(theta),
y = sin(theta),
grp = 'A') %>%
select(-theta)
df.B <- df.A %>% mutate( x = x+1, grp='B')
df.Ac <- rbind(
df.A %>% filter(x <= .5),
df.B %>% filter(x <= .5))
df.AnB <- rbind(
df.A %>% filter(x >= .5),
df.B %>% filter(x <= .5))
df.Bc <- rbind(
df.A %>% filter(x >= .5),
df.B %>% filter(x >= .5))
df.box <- data.frame( x=c(-1.2, 2.2, 2.2, -1.2, -1.2),
y=c(-1.2, -1.2 ,1.2, 1.2, -1.2),
grp='Box' )
df <- rbind(df.A, df.B, df.box)
df.label <- NULL %>%
rbind( data.frame(x= -.75, y= .85, label='A'),
data.frame(x= 1.75, y= -.85, label='B'),
data.frame(x= 2.1, y= 1.1, label='S'))
ggplot(df, aes(x=x, y=y)) +
geom_path( aes(group=grp)) +
geom_polygon(data=df.AnB,
aes(y=y), alpha=.4, fill='grey') +
scale_x_continuous(breaks=NULL) +
scale_y_continuous(breaks=NULL) +
geom_text( data=df.label, aes(label=label)) +
labs(x='', y='') +
ggtitle( expression(A*intersect(B))) +
theme_minimal()
```
* Denote the event that $A$ does not occur as $\bar{A}$ or $A^{C}$ (different people use different notations)
```{r, echo=FALSE}
df.A <- data.frame(theta=seq(0,2*pi, length=1000)) %>%
mutate(x = cos(theta),
y = sin(theta),
grp = 'A') %>%
select(-theta)
df.B <- df.A %>% mutate( x = x+1, grp='B')
df.Ac <- rbind(
df.A %>% filter(x <= .5),
df.B %>% filter(x <= .5))
df.AnB <- rbind(
df.A %>% filter(x >= .5),
df.B %>% filter(x <= .5))
df.Bc <- rbind(
df.A %>% filter(x >= .5),
df.B %>% filter(x >= .5))
df.box <- data.frame( x=c(-1.2, 2.2, 2.2, -1.2, -1.2),
y=c(-1.2, -1.2 ,1.2, 1.2, -1.2),
grp='Box' )
df <- rbind(df.A, df.B, df.box)
df.label <- NULL %>%
rbind( data.frame(x= -.75, y= .85, label='A'),
data.frame(x= 1.75, y= -.85, label='B'),
data.frame(x= 2.1, y= 1.1, label='S'))
ggplot(df, aes(x=x, y=y)) +
geom_polygon(data=df.box,
aes(y=y), alpha=.4, fill='grey') +
geom_polygon( data=df.A, fill='white') +
# geom_polygon( data=df.B %>% filter(x <=.5), fill='grey', alpha=.4) +
geom_path( aes(group=grp)) +
scale_x_continuous(breaks=NULL) +
scale_y_continuous(breaks=NULL) +
geom_text( data=df.label, aes(label=label)) +
labs(x='', y='') +
ggtitle( expression( paste( bar(A), ' or ', A^c )) ) +
theme_minimal()
```
**Definition 1**. Two events $A$ and $B$ are said to be mutually exclusive (or disjoint) if the occurrence of one event precludes the occurrence of the other. For example, on a single roll of a die, a two and a five cannot both come up. For a second example, define $A$ to be the event that the die is even, and $B$ to be the event that the die comes up as a $5$.
```{r, echo=FALSE}
df.A <- data.frame(theta=seq(0,2*pi, length=1000)) %>%
mutate(x = cos(theta),
y = sin(theta),
grp = 'A') %>%
select(-theta)
df.B <- df.A %>% mutate( x = x+2.5, grp='B')
df.box <- data.frame( x=c(-1.2, 3.8, 3.8, -1.2, -1.2),
y=c(-1.2, -1.2 ,1.2, 1.2, -1.2),
grp='Box' )
df <- rbind(df.A, df.B, df.box)
df.label <- NULL %>%
rbind( data.frame(x= -.75, y= .85, label='A'),
data.frame(x= 1.75, y= -.85, label='B'),
data.frame(x= 3.65, y= 1.1, label='S'))
ggplot(df, aes(x=x, y=y)) +
geom_path( aes(group=grp)) +
scale_x_continuous(breaks=NULL) +
scale_y_continuous(breaks=NULL) +
geom_text( data=df.label, aes(label=label)) +
labs(x='', y='') + theme_minimal()
```
## Probability Rules
### Simple Rules
We now take our Venn diagrams and use them to understand the rules of probability. The underlying idea that we will use is the the probability of an event is the area in the Venn diagram.
**Definition 2**. Probability is the proportion of times an event occurs in many repeated trials of a random phenomenon. In other words, probability is the long-term relative frequency.
**Fact**. *For any event $A$ the probability of the event $P(A)$ satisfies $0\le P(A) \le 1$ because proportions always lie in $[0,1]$.*
Because $S$ is the set of all events that might occur, the area of our bounding rectangle will be $1$ and the probability of event $A$ occurring will be represented by the area in the circle $A$.
**Fact**. *If two events are mutually exclusive, then $P(A\cup B)=P(A)+P(B)$*
```{r, echo=FALSE}
df.A <- data.frame(theta=seq(0,2*pi, length=1000)) %>%
mutate(x = cos(theta),
y = sin(theta),
grp = 'A') %>%
select(-theta)
df.B <- df.A %>% mutate( x = x+2.5, grp='B')
df.box <- data.frame( x=c(-1.2, 3.8, 3.8, -1.2, -1.2),
y=c(-1.2, -1.2 ,1.2, 1.2, -1.2),
grp='Box' )
df <- rbind(df.A, df.B, df.box)
df.label <- NULL %>%
rbind( data.frame(x= -.75, y= .85, label='A'),
data.frame(x= 1.75, y= -.85, label='B'),
data.frame(x= 3.65, y= 1.1, label='S'))
ggplot(df, aes(x=x, y=y)) +
geom_polygon(data=df.A, fill='grey', alpha=.7) +
geom_polygon(data=df.B, fill='grey', alpha=.7) +
geom_path( aes(group=grp)) +
scale_x_continuous(breaks=NULL) +
scale_y_continuous(breaks=NULL) +
geom_text( data=df.label, aes(label=label)) +
labs(x='', y='') +
ggtitle(expression( P( A*union(B) ) == P(A) + P(B) )) +
theme_minimal()
```
**Example**. Let $R$ be the sum of two different colored dice. Suppose we are interested in $P(R \le 4)$. Notice that the pair of dice can fall 36 different ways (6 ways for the first die and six for the second results in 6x6 possible outcomes, and each way has equal probability $1/36$. Because the dice cannot simultaneously sum to $2$ and to $3$, we could write
$$\begin{aligned} P(R \le 4 )
&= P(R=2)+P(R=3)+P(R=4) \\
&= P(\left\{ 1,1\right\} )+P(\left\{ 1,2\right\} \mathrm{\textrm{ or }}\left\{ 2,1\right\} )+P(\{1,3\}\textrm{ or }\{2,2\}\textrm{ or }\{3,1\}) \\
&= \frac{1}{36}+\frac{2}{36}+\frac{3}{36} \\
&= \frac{6}{36} \\
&= \frac{1}{6} \end{aligned}$$
**Fact**. $P(A)+P(\bar{A})=1$
```{r, echo=FALSE}
df.A <- data.frame(theta=seq(0,2*pi, length=1000)) %>%
mutate(x = cos(theta),
y = sin(theta),
grp = 'A') %>%
select(-theta)
df.B <- df.A %>% mutate( x = x+2.5, grp='B')
df.box <- data.frame( x=c(-1.2, 3.8, 3.8, -1.2, -1.2),
y=c(-1.2, -1.2 ,1.2, 1.2, -1.2),
grp='Box' )
df <- rbind(df.A, df.box)
df.label <- NULL %>%
rbind( data.frame(x= -.75, y= .85, label='A'),
data.frame(x= 1.75, y= 0, label='A^c'),
data.frame(x= 3.65, y= 1.1, label='S'))
ggplot(df, aes(x=x, y=y)) +
geom_polygon(data=df.box, fill='pink', alpha=.7) +
geom_polygon(data=df.A, fill='grey', alpha=.7) +
geom_path( aes(group=grp)) +
scale_x_continuous(breaks=NULL) +
scale_y_continuous(breaks=NULL) +
geom_text( data=df.label, aes(label=label), parse=TRUE) +
labs(x='', y='') +
ggtitle(expression( P( A ) + P(A^c) == 1 )) +
theme_minimal()
```
The above statement is true because the probability of whole space $S$ is one (remember $S$ is all possible outcomes), then either we get an outcome in which $A$ occurs or we get an outcome in which $A$ does not occur.
**Fact**. $P(A\cup B)=P(A)+P(B)-P(A\cap B)$
The reason behind this fact is that if there is if $A$ and $B$ are not disjoint, then some area is added twice when I calculate $P\left(A\right)+P\left(B\right)$. To account for this, I simply subtract off the area that was double counted.
```{r, echo=FALSE}
df.A <- data.frame(theta=seq(0,2*pi, length=1000)) %>%
mutate(x = cos(theta),
y = sin(theta),
grp = 'A') %>%
select(-theta)
df.B <- df.A %>% mutate( x = x+1, grp='B')
df.Ac <- rbind(
df.A %>% filter(x <= .5),
df.B %>% filter(x <= .5))
df.AnB <- rbind(
df.A %>% filter(x >= .5),
df.B %>% filter(x <= .5))
df.Bc <- rbind(
df.A %>% filter(x >= .5),
df.B %>% filter(x >= .5))
df.box <- data.frame( x=c(-1.2, 2.2, 2.2, -1.2, -1.2),
y=c(-1.2, -1.2 ,1.2, 1.2, -1.2),
grp='Box' )
df <- rbind(df.A, df.B, df.box)
df.label <- NULL %>%
rbind( data.frame(x= -.75, y= .85, label='A'),
data.frame(x= 1.75, y= -.85, label='B'),
data.frame(x= 2.1, y= 1.1, label='S'))
ggplot(df, aes(x=x, y=y)) +
geom_path( aes(group=grp)) +
geom_polygon(data=df.A, alpha=.4, fill='red') +
geom_polygon(data=df.B, alpha=.4, fill='blue') +
scale_x_continuous(breaks=NULL) +
scale_y_continuous(breaks=NULL) +
geom_text( data=df.label, aes(label=label)) +
labs(x='', y='') +
ggtitle(expression( P(A*union(B)) == P(A) + P(B) - P(A*intersect(B)) )) +
theme_minimal()
```
**Fact 3**. $P(A)=P(A\cap B)+P(A\cap\bar{B})$
```{r, echo=FALSE}
df.A <- data.frame(theta=seq(0,2*pi, length=1000)) %>%
mutate(x = cos(theta),
y = sin(theta),
grp = 'A') %>%
select(-theta)
df.B <- df.A %>% mutate( x = x+1, grp='B')
df.Ac <- rbind(
df.A %>% filter(x <= .5),
df.B %>% filter(x <= .5))
df.AnB <- rbind(
df.A %>% filter(x >= .5),
df.B %>% filter(x <= .5))
df.Bc <- rbind(
df.A %>% filter(x >= .5),
df.B %>% filter(x >= .5))
df.box <- data.frame( x=c(-1.2, 2.2, 2.2, -1.2, -1.2),
y=c(-1.2, -1.2 ,1.2, 1.2, -1.2),
grp='Box' )
df <- rbind(df.A, df.B, df.box)
df.label <- NULL %>%
rbind( data.frame(x= -.75, y= .85, label='A'),
data.frame(x= 1.75, y= -.85, label='B'),
data.frame(x= 2.1, y= 1.1, label='S'))
ggplot(df, aes(x=x, y=y)) +
geom_path( aes(group=grp)) +
geom_polygon(data=df.A, alpha=.4, fill='red') +
geom_polygon(data=df.AnB, alpha=.4, fill='blue') +
scale_x_continuous(breaks=NULL) +
scale_y_continuous(breaks=NULL) +
geom_text( data=df.label, aes(label=label)) +
labs(x='', y='') +
ggtitle(expression( P(A) == P(A*intersect(B^c)) + P(A*intersect(B)) )) +
theme_minimal()
```
This identity is just breaking the event $A$ into two disjoint pieces.
### Conditional Probability
We are given the following data about insurance claims. Notice that the data is given as $P(\;Category\;\cap\;PolicyType\;)$ which is apparent because the sum of all the elements in the table is $100\%$
| $\,$ | Fire | Auto | Other |
|-------------------:|:------:|:-----:|:--------:|
| **Fraudulant** | 6% | 1% | 3% |
| **non-Fraudulant** | 14% | 29% | 47% |
Summing across the rows and columns, we can find the probabilities of for each category and policy type.
| $\,$ | Fire | Auto | Other | $\,$ |
|--------------------:|:--------:|:-------:|:---------:|:----------:|
|**Fraudulant** | 6% | 1% | 3% | **10%** |
| **non-Fraudulant** | 14% | 29% | 47% | **90%** |
| $\,$ | **20%** | **30%** | **50%** | **100%** |
It is clear that fire claims are more likely fraudulent than auto or other claims. In fact, the proportion of fraudulent claims, given that the claim is against a fire policy is
$$\begin{aligned}
P(\textrm{ Fraud }|\textrm{ FirePolicy }) &= \frac{\textrm{proportion of claims that are fire policies and are fraudulent}}{\textrm{proportion of fire claims}} \\
&= \frac{6\%}{20\%}\\
& \\
&= 0.3
\end{aligned}$$
In general we define conditional probability (assuming $P(B) \ne 0$) as
$$P(A|B)=\frac{P(A\cap B)}{P(B)}$$
which can also be rearranged to show
$$\begin{aligned}
P(A\cap B) &= P(A\,|\,B)\,P(B) \\
&= P(B\,|\,A)\,P(A)
\end{aligned}$$
Because the order doesn't matter and $P\left(A\cap B\right)=P\left(B\cap A\right)$.
Using this rule, we might calculate the probability that a claim is an Auto policy given that it is not fraudulent.
$$\begin{aligned}
P\left(\,Auto\;|\;NotFraud\,\right) &= \frac{P\left(\,Auto\;\cap\;NotFraud\right)}{P\left(\,NotFraud\,\right)} \\
&= \frac{0.29}{0.9} \\
& \\
&= 0.3\bar{2}
\end{aligned}$$
**Definition 4**. *Two events $A$ and $B$ are said to be independent if $P(A|B)=P(A)\;\;\textrm{and}\;\;P(B|A)=P(B)$.*
What independence is saying that knowing the outcome of event $A$ doesn't give you any information about the outcome of event $B$.
*In simple random sampling, we assume that any two samples are independent.
* In cluster sampling, we assume that samples within a cluster are not independent, but clusters are independent of each other.
**Fact 5**. *If $A$ and $B$ are independent events, then $P(A\cap B) = P(A|B)P(B) = P(A)P(B)$.*
**Example 6**. Suppose that we are interested in the relationship between the color and the type of car. Specifically I will divide the car world into convertibles and non-convertibles and the colors into red and non-red.
Suppose that convertibles make up just 10% of the domestic automobile market. This is to say $P(\;Convertable\;)=0.10$. Of the non-convertibles, red is not unheard of but it isn't common either. So suppose $P(\;Red\;|\;NonConvertable\;)=0.15$. However red is an extremely popular color for convertibles so let $P(\;Red\;|\;Convertable\;)=0.60$.
Given the above information, we can create the following table:
| $\,$ | Convertible | Not Convertible | $\,$ |
|-------------:|:-------------------:|:------------------:|:----------:|
| **Red** | | | |
| **Not Red** | | | |
| $\,$ | **10%** | **90%** | **100%** |
We can fill in some of the table using our the definition of conditional probability. For example:
$$\begin{aligned}
P\left(Red\,\cap\,Convertable\right) &= P\left(Red\,|\,Convertable\right)\,P\left(Convertable\right) \\
&= 0.60*0.10 \\
&= 0.06
\end{aligned}$$
Lets think about what this conditional probability means. Of the $90\%$ of cars that are not convertibles, $15\%$ those non-convertibles are red and therefore the proportion of cars that are red non-convertibles is $0.90*0.15=0.135$. Of the $10\%$ of cars that are convertibles, $60\%$ of those are red and therefore proportion of cars that are red convertibles is $0.10*0.60=0.06$. Thus the total percentage of red cars is actually
$$\begin{aligned}P\left(\,Red\,\right)
&= P\left(\;Red\;\cap\;Convertible\;\right)+P\left(\,Red\,\cap\,NonConvertible\,\right)\\
&= P\left(\,Red\,|\,Convertable\,\right)P\left(\,Convertible\,\right)+P\left(\,Red\,|\,NonConvertible\,\right)P\left(\,NonConvertible\,\right)\\
&= 0.60*0.10+0.15*0.90\\
&= 0.06+0.135\\
&= 0.195
\end{aligned}$$
So when I ask for $P(\;red\;|\;convertable\;)$, I am narrowing my space of cars to consider only convertibles. While there percentage of cars that are red and convertible is just 6% of all cars, when I restrict myself to convertibles, we see that the percentage of this smaller set of cars that are red is 60%.
Notice that because $P\left(Red\right)=0.195\ne0.60=P\left(Red\,|\,Convertable\right)$ then the events $Red$ and $Convertable$ are not independent.
### Summary of Probability Rules
$$0 \le P\left(A\right) \le 1$$
$$P\left(A\right)+P\left(\bar{A}\right)=1$$
$$P\left(A\cup B\right) = P\left(A\right)+P\left(B\right)-P\left(A\cap B\right)$$
$$P\left(A\cap B\right) = \begin{cases}
P\left(A\,|\,B\right)P\left(B\right)\\
P\left(B\,|\,A\right)P\left(A\right)\\
P(A)P(B)\;\; & \textrm{ if A,B are independent}
\end{cases}$$
$$P\left(A\,|\,B\right) = \frac{P\left(A\cap B\right)}{P\left(B\right)}$$
## Discrete Random Variables
The different types of probability distributions (and therefore your analysis method) can be divided into two general classes:
1. Continuous Random Variables - the variable takes on numerical values and could, in principle, take any of an uncountable number of values. In practical terms, if fractions or decimal points in the number make sense, it is usually continuous.
2. Discrete Random Variables - the variable takes on one of small set of values (or only a countable number of outcomes). In practical terms, if fractions or decimals points don't make sense, it is usually discrete.
Examples:
1. Presence or Absence of wolves in a State?
2. Number of Speeding Tickets received?
3. Tree girth (in cm)?
4. Photosynthesis rate?
### Introduction to Discrete Random Variables
The following facts hold for discrete random variables:
1. The probability associated with every value lies between 0 and 1
2. The sum of all probabilities for all values is equal to 1
3. Probabilities for discrete RVs are additive. i.e., $P(3\textrm{ or }4)=P(3)+P(4)$
#### Expected Value
Example: Consider the discrete random variable $S$, the sum of two fair dice.
```{r, echo=FALSE}
data <- data.frame(x = 2:12, y = c(1:6,5:1)/36 )
ggplot(data, aes(x=x, y=y)) +
geom_point() +
geom_linerange(aes(ymax=y, ymin=0)) +
ylab('Probability') + xlab('Sum') +
ggtitle('Distribution of the sum of two dice') +
scale_x_continuous(breaks = c(2,4,6,8,10,12)) +
theme_bw()
```
We often want to ask 'What is expected value of this distribution?' You might think about taking a really, really large number of samples from this distribution and then taking the mean of that really really big sample. We define the expected value (often denoted by $\mu$) as a weighted average of the possible values and the weights are the proportions with which those values occur.
$$\mu=E[S] = \sum_{\textrm{possible }s}\;s\cdot P\left(S=s\right)$$
In this case, we have that
$$\begin{aligned} \mu = E[S]
&= \sum_{s=2}^{12}s\cdot P(S=s) \\
&= 2\cdot P\left(S=2\right)+3\cdot P\left(S=3\right)+\dots+11\cdot P\left(S=11\right)+12\cdot P\left(S=12\right) \\
&= 2\left(\frac{1}{36}\right)+3\left(\frac{2}{36}\right)+\dots+11\left(\frac{2}{36}\right)+12\left(\frac{1}{36}\right) \\
&= 7
\end{aligned}$$
#### Variance
Similarly we could define the variance of $S$ (which we often denote $\sigma^{2}$) as a weighted average of the squared-deviations that could occur.
$$ \sigma^{2}=V[S] = \sum_{\textrm{possible }s}\; (s-\mu)^2 \cdot P\left(S=s\right)$$
which in this example can be calculated as
$$\begin{aligned} \sigma^{2}=V[S]
&= \sum_{s=2}^{12}\left(s-\mu\right)^{2}P(S=s) \\
&= (2-7)^{2}\left(\frac{1}{36}\right)+(3-7)^{2}\left(\frac{2}{36}\right)+\dots+(12-7)^{2}\left(\frac{1}{36}\right) \\
&= \frac{35}{6}=5.8\bar{3}
\end{aligned}$$
We could interpret the expectation as the sample mean of an infinitely large sample, and the variance as the sample variance of the same infinitely large sample. These are two very important numbers that describe the distribution.
**Example 7**. My wife is a massage therapist and over the last year, the number of clients she sees per work day (denoted Y) varied according the following table:
| Number of Clients | 0 | 1 | 2 | 3 | 4 |
|---------------------------:|:-----:|:-----:|:-----:|:-----:|:-----:|
| **Frequency/Probability** | 0.30 | 0.35 | 0.20 | 0.10 | 0.05 |
```{r}
distr <- data.frame( clients = c( 0, 1, 2, 3, 4 ), # two columns
probability = c(0.3, 0.35, 0.20, 0.10, 0.05 ) ) #
ggplot(distr, aes(x=clients)) + # graph with clients as the x-axis
geom_point(aes(y=probability)) + # where the dots go
geom_linerange(aes(ymax=probability, ymin=0)) + # the vertical lines
theme_bw() # set background color...
```
Because this is the long term relative frequency of the number of clients (over 200 working days!), it is appropriate to interpret these frequencies as probabilities. This table and graph is often called a probability mass function (pmf) because it lists how the probability is spread across the possible values of the random variable. We might next ask ourselves what is the average number of clients per day? It looks like it ought to be between 1 and 2 clients per day.
$$\begin{aligned} E\left(Y\right)
&= \sum_{\textrm{possible }y}y\,P\left(Y=y\right) \\
&= \sum_{y=0}^{4}y\,P\left(Y=y\right) \\
&= 0\,P\left(Y=0\right)+1\,P\left(Y=1\right)+2\,P\left(Y=2\right)+3\,P\left(Y=3\right)+4\,P\left(Y=4\right) \\
&= 0\left(0.3\right)+1\left(0.35\right)+2\left(0.20\right)+3\left(0.10\right)+4\left(0.05\right) \\
&= 1.25 \end{aligned}$$
Notice that this number is not an integer and therefore is not a value that $Y$ could actually take on. You might be tempted to therefore round it to the nearest integer. That would be wrong. The rational is that if we wanted to estimate the number of clients she has per month (and thus her income), we would have a worse estimate if we used the rounded number.
Another example of a case where rounding would be inappropriate is in gambling situations where the amount won or lost per hand isn't particularly important but the average amount won or lost over hundreds or thousands of plays is what matters. A Roulette wheel has 18 red and 18 black slots along with 2 green. If you bet $1 on red, you could either win a dollar or lose a dollar. However, because the probabilities are
| | Win ( + $1 ) | Lose (- $1) |
|------------------:|:----------------------:|:-----------------:|
| **Probability** | $\frac{18}{38}$ | $\frac{20}{38}$ |
then the persons expected winnings per play are:
$$ \begin{aligned}E[W]
= \sum_{\textrm{possible }w}w\,P\left(W=w\right)
= 1 \left(\frac{18}{38} \right) + -1 \left( \frac{20}{38} \right)
= -0.0526 \end{aligned}$$
So for every Black/Red bet, the player should expect to lose 5.2 cents. While this number is small, it is enough to make the casino millions of dollars over the long run.
Returning to the massage therapy example, assuming that successive days are independent (which might be a bad assumption) what is the probability she has two days in a row with no clients?
$$\begin{aligned}P\left(\textrm{0 on day1 }and\textrm{ 0 on day2}\right)
&= P\left(\textrm{0 on day 1}\right)P\left(\textrm{0 on day 2}\right) \\
&= \left(0.3\right)\left(0.3\right) \\
&= 0.09 \end{aligned}$$
What is the variance of this distribution?
$$\begin{aligned}V\left(Y\right)
&= \sum_{\textrm{possible y}}\,\left(y-\mu\right)^{2}\,P\left(Y=y\right) \\
&= \sum_{y=0}^{4}\,\left(y-\mu\right)^{2}P\left(Y=y\right) \\
&= \left(0-1.25\right)^{2}\left(0.3\right)+\left(1-1.25\right)^{2}\left(0.35\right)+\left(2-1.25\right)^{2}\left(0.20\right)+\left(3-1.25\right)^{2}\left(0.10\right)+\left(4-1.25\right)^{2}\left(0.05\right) \\
&= 1.2875 \end{aligned}$$
Note on Notation: There is a difference between the upper and lower case letters we have been using to denote a random variable. In general, we let the upper case denote the random variable and the lower case as a value that the the variable could possibly take on. So in the massage example, the number of clients seen per day $Y$ could take on values $y=0,1,2,3,$ or $4$.
## Common Discrete Distributions
### Binomial Distribution
**Example**: Suppose we are trapping small mammals in the desert and we spread out three traps. Assume that the traps are far enough apart that having one being filled doesn't affect the probability of the others being filled and that all three traps have the same probability of being filled in an evening. Denote the event that a trap is filled with a critter as $C_{i}$ and denote the event that the trap is empty as $E_{i}$. Denote the probability that a trap is filled by $\pi=0.8$. (This sort of random variable is often referred to as a Bernoulli RV.)
The possible outcomes are
| Outcome | $\,$ |
|:------------------:|:------:|
| $E_1, E_2, E_3$ | $\,$ |
| $C_1, E_2, E_3$ | $\,$ |
| $E_1, C_2, E_3$ | $\,$ |
| $E_1, E_2, C_3$ | $\,$ |
| $C_1, C_2, E_3$ | $\,$ |
| $C_1, E_2, C_3$ | $\,$ |
| $E_1, C_2, C_3$ | $\,$ |
| $C_1, C_2, C_3$ | $\,$ |
Because these are far apart enough in space that the outcome of Trap1 is independent of Trap2 and Trap3, then
$$P(E_{1}\cap C_{2}\cap E_{3}) = P(E_{1})P(C_{2})P(E_{3})
= (1-0.8)0.8(1-0.8)
= 0.032$$
**Notice how important the assumption of independence is!!!** Similarly we could calculate the probabilities for the rest of the table.
| Outcome | Probability | $S$ Outcome | Probability |
|:------------------:|:---------------:|:-------------:|:--------------------:|
| $E_1, E_2, E_3$ | 0.008 | $S=0$ | 0.008 |
|------------------- | --------------- | ------------- | --------------- |
| $C_1, E_2, E_3$ | 0.032 | | |
| $E_1, C_2, E_3$ | 0.032 | $S=1$ | $3(0.032) = 0.096$ |
| $E_1, E_2, C_3$ | 0.032 | | |
|------------------- | --------------- | ------------- | --------------- |
| $C_1, C_2, E_3$ | 0.128 | | |
| $C_1, E_2, C_3$ | 0.128 | $S=2$ | $3(0.128) = 0.384$ |
| $E_1, C_2, C_3$ | 0.128 | | |
|------------------- | --------------- | ------------- | --------------- |
| $C_1, C_2, C_3$ | 0.512 | $S=3$ | $0.512$ |
Next we are interested in the random variable $S$, the number of traps that were filled:
| $S$ Outcome | Probability |
|:-------------:|:---------------:|
| $S=0$ | $0.008$ |
| $S=1$ | $0.096$ |
| $S=2$ | $0.384$ |
| $S=3$ | $0.512$ |
$S$ is an example of a Binomial Random Variable. A binomial experiment is one that:
1. Experiment consists of $n$ identical trials.
2. Each trial results in one of two outcomes (Heads/Tails, presence/absence). One will be labeled a success and the other a failure.
3. The probability of success on a single trial is equal to $\pi$ and remains the same from trial to trial.
4. The trials are independent (this is implied from property 3).
5. The random variable $Y$ is the number of successes observed during $n$ trials.
Recall that the probability mass function (pmf) describes how the probability is spread across the possible outcomes, and in this case, I can describe this via a nice formula. The pmf of a a binomial random variable $X$ taken from $n$ trials each with probability of success $\pi$ is
$$P(X=x)=\underbrace{\frac{n!}{x!(n-x)!}}_{orderings}\;\underbrace{\pi^{x}}_{y\,successes}\;\underbrace{(1-\pi)^{n-x}}_{n-y\,failures}$$
where we define $n!=n(n-1)\dots(2)(1)$ and further define $0!=1$. Often the ordering term is written more compactly as $${n \choose x}=\frac{n!}{x!\left(n-x\right)!}$$.
For our small mammal example we can create a graph that shows the binomial distribution with the following R code:
```{r}
dist <- data.frame( x=0:3 ) %>%
mutate(probability = dbinom(x, size=3, prob=0.8))
ggplot(dist, aes(x=x)) +
geom_point(aes(y=probability)) +
geom_linerange(aes(ymax=probability, ymin=0)) +
ggtitle('Binomial distribution: n=3, p=0.8') +
theme_bw()
```
To calculate the height of any of these bars, we can evaluate the pmf at the desired point. For example, to calculate the probability the number of full traps is 2, we calculate the following
$$\begin{aligned} P(X=2)
&= {3 \choose 2}\left(0.8\right)^{2}\left(1-0.8\right)^{3-2} \\
&= \frac{3!}{2!(3-2)!}(0.8)^{2}(0.2)^{3-2} \\
&= \frac{3\cdot2\cdot1}{(2\cdot1)1}\;(0.8)^{2}(0.2) \\
&= 3(0.128) \\
&= 0.384 \end{aligned}$$
You can use R to calculate these probabilities. In general, for any distribution, the “d-function” gives the distribution function (pmf or pdf). So to get R to do the preceding calculation we use:
```{r}
# If X ~ Binomial(n=3, pi=0.8)
# Then P( X = 2 | n=3, pi=0.8 ) =
dbinom(2, size=3, prob=0.8)
```
The expectation of this distribution can be shown to be
$$\begin{aligned}E[X]
&= \sum_{x=0}^{n}x\,P(X=x) \\
&= \sum_{x=0}^{n}x\;\frac{n!}{x!\left(n-x\right)!}\pi^{x}\left(1-\pi\right)^{n-x}\\
&= \vdots \\
&= n\pi \end{aligned}$$
and the variance can be similarly calculated
$$\begin{aligned} V[X]
&= \sum_{x=0}^{n}\left(x-E\left[X\right]\right)^{2}\,P\left(X=x|n,\pi\right) \\
&= \sum_{x=0}^{n}\left(x-E\left[X\right]\right)^{2}\;\frac{n!}{x!\left(n-x\right)!}\pi^{x}\left(1-\pi\right)^{n-x} \\
&= \vdots \\
&= n\pi(1-\pi) \end{aligned}$$
**Example 8**. Suppose a bird survey only captures the presence or absence of a particular bird (say the mountain chickadee). Assuming the true presence proportion at national forest sites around Flagstaff $$\pi=0.1$$, then for $n=20$ randomly chosen sites, the number of sites in which the bird was observed would have the distribution
```{r}
dist <- data.frame( x = 0:20 ) %>%
mutate(probability = dbinom(x, size=20, prob=0.1))
ggplot(dist, aes(x=x)) +
geom_point(aes(y=probability)) +
geom_linerange(aes(ymax=probability, ymin=0)) +
ggtitle('Binomial distribution: n=20, p=0.1') +
xlab('Number of Sites Occupied') +
theme_bw()
```
Often we are interested in questions such as $P(X\le2)$ which is the probability that we see 2 or fewer of the sites being occupied by mountain chickadee. These calculations can be tedious to calculate by hand but R will calculate these cumulative distribution function values for you using the “p-function”. This cumulative distribution function gives the sum of all values up to and including the number given.
```{r}
# P(X=0) + P(X=1) + P(X=2)
sum <- dbinom(0, size=20, prob=0.1) +
dbinom(1, size=20, prob=0.1) +
dbinom(2, size=20, prob=0.1)
sum
# P(X <= 2)
pbinom(2, size=20, prob=0.1)
```
In general we will be interested in asking four different questions about a distribution.
1. What is the height of the probability mass function (or probability density function). For discrete variable $Y$ this is $P\left(Y=y\right)$ for whatever value of $y$ we want. In R, this will be the `d`-function.
2. What is the probability of observing a value less than or equal to $y$? In other words, to calculate $P\left(Y\le y\right)$. In R, this will be the `p`-function.
3. What is a particular quantile of a distribution? For example, what value separates the lower $25\%$ from the upper $75\%$? In R, this will be the `q`-function.
4. Generate a random sample of values from a specified distribution. In R, this will be the `r`-function.
### Poisson Distribution
A commonly used distribution for count data is the Poisson.
1. Number of customers arriving over a 5 minute interval
2. Number of birds observed during a 10 minute listening period
3. Number of prairie dog towns per 1000 hectares
4. Number of alga clumps per cubic meter of lake water
For a RV is a Poisson RV if the following conditions apply:
1. Two or more events do not occur at precisely the same time or in the same space
2. The occurrence of an event in a given period of time or region of space is independent of the occurrence of the event in a non overlapping period or region.
3. The expected number of events during one period or region, $\lambda$, is the same in all periods or regions of the same size.
Assuming that these conditions hold for some count variable $Y$, the the probability mass function is given by
$$P(Y=y)=\frac{\lambda^{y}e^{-\lambda}}{y!}$$
where $\lambda$ is the expected number of events over 1 unit of time or space and $e$ is the constant $2.718281828\dots$.
$$E[Y] = \lambda$$
$$Var[Y] = \lambda$$
**Example 9**. Suppose we are interested in the population size of small mammals in a region. Let $Y$ be the number of small mammals caught in a large trap (multiple traps in the same location?) in a 12 hour period. Finally, suppose that $Y\sim Poi(\lambda=2.3)$. What is the probability of finding exactly 4 critters in our trap?
$$P(Y=4) = \frac{2.3^{4}\,e^{-2.3}}{4!} = 0.1169$$
What about the probability of finding at most 4?
$$\begin{aligned} P(Y\le4)
&= P(Y=0)+P(Y=1)+P(Y=2)+P(Y=3)+P(Y=4) \\
&= 0.1003+0.2306+0.2652+0.2033+0.1169 \\
&= 0.9163 \end{aligned}$$
What about the probability of finding 5 or more?
$$P(Y\ge5) = 1-P(Y\le4) = 1-0.9163 = 0.0837$$
These calculations can be done using the distribution function (d-function) for the Poisson and the cumulative distribution function (p-function).
```{r}
dist <- data.frame( NumCaught = 0:10 ) %>%
mutate( probability = dpois( NumCaught, lambda=2.3 ) )
ggplot(dist, aes(x=NumCaught)) +
geom_point( aes(y=probability) ) +
geom_linerange(aes( ymax=probability, ymin=0)) +
ggtitle(expression(paste('Poisson Distribution with ', lambda == 2.3))) +
labs(x='Number Caught') +
theme_bw()
# P( Y = 4)
dpois(4, lambda=2.3)
# P( Y <= 4)
ppois(4, lambda=2.3)
# 1-P(Y <= 4) == P( Y > 4) == P( Y >= 5)
1-ppois(4, 2.3)
```
## Continuous Random Variables
Continuous random variables can take on an (uncountably) infinite number of values, and this results in a few obnoxious mathematical differences between how we handle continuous and discrete random variables. In particular, the probability that a continuous random variable $X$ will take on a particular value will be zero, so we will be interested in finding the probability that the random variable is in some interval instead. Wherever we had a summation, $\sum$, we will instead have an integral, but because many students haven't had calculus, we will resort to using R or tables of calculated values.
### Uniform(0,1) Distribution
Suppose you wish to draw a random number number between 0 and 1 and any two intervals of equal size should have the same probability of the value being in them. This random variable is said to have a Uniform(0,1) distribution.
Because there are an infinite number of rational numbers between 0 and 1, the probability of any particular number being selected is $1/\infty=0$. But even though each number has 0 probability of being selected, some number must end up being selected. Because of this conundrum, probability theory doesn't look at the probability of a single number, but rather focuses on a region of numbers.
To make this distinction, we will define the distribution using a probability density function (pdf) instead of the probability mass function. In the discrete case, we had to constrain the probability mass function to sum to 1. In the continuous case, we have to constrain the probability density function to integrate to 1.
```{r, echo=FALSE}
data <- data.frame(x = seq(-.2, 1.2, length=1000)) %>%
mutate( y = dunif(x))
ggplot(data, aes(x=x, y=y)) +
geom_line() +
labs(x='X', y='density', title='PDF of Uniform(0,1)') +
geom_text(x=.5, y=.5, label='Area = 1') +
theme_bw()
```
```{r, echo=FALSE}
data <- data.frame(x = seq(-.2, 1.2, length=1000)) %>%
mutate( y = dunif(x))
ggplot(data, aes(x=x, y=y)) +
geom_line() +
geom_area(data=data %>% filter(x < .4), fill='grey', alpha=.4) +
labs(x='X', y='density', title='PDF of Uniform(0,1)') +
geom_text(x=.2, y=.5, label='P(X<=0.4) == 0.4', parse=TRUE) +
theme_bw()
```
Finding the area under the curve of a particular density function $f(x)$ usually requires the use of calculus, but since this isn't a calculus course, we will resort to using R or tables of calculated values.
### Exponential Distribution
The exponential distribution is the continuous analog of the Poisson distribution and is often used to model the time between occurrence of successive events. Perhaps we are modeling time between transmissions on a network, or the time between feeding events or prey capture. If the random variable $X$ has an Exponential distribution, its probability density function is
$$f(x)=\begin{cases}
\lambda e^{-\lambda x} & x\ge0\;\textrm{ and }\;\lambda>0\\
0 & \textrm{otherwise}
\end{cases}$$
Analogous to the discrete distributions, we can define the Expectation and Variance of these distributions by replacing the summation with an integral
$$\mu = E[X] = \int_{0}^{\infty}x\,f(x)\,dx = \dots = \frac{1}{\lambda} $$
$$\sigma^2 = Var[X] = \int_{0}^{\infty}\left(x-\mu\right)^{2}\,f\left(x\right)\,dx = \dots = \frac{1}{\lambda^{2}}$$
Because the exponential distribution is defined by the rate of occurrence of an event, increasing that rate decreases the time between events. Furthermore because the rate of occurrence cannot be negative, we restrict $\lambda>0$.
```{r, echo=FALSE}
data <- expand.grid(x=seq(0,10,length=1000), lambda = c(.5, 1, 2)) %>%
mutate(y=dexp(x, rate = lambda),
lambda = paste('lambda == ', lambda))
ggplot(data, aes(x=x, y=y)) +
geom_area(fill='grey') +
facet_grid(lambda~., labeller = labeller(lambda=label_parsed)) +
labs(y='density') +
theme_bw()
```
**Example 10**. Suppose the time between insect captures $X$ during a summer evening for a species of bat follows a exponential distribution with capture rate of $\lambda=2$ insects per minute and therefore the expected waiting time between captures is $1/\lambda=1/2$ minute. Suppose that we are interested in the probability that it takes a bat more than 1 minute to capture its next insect.
$$P(X>1)=$$
```{r}
data <- data.frame(x=seq(0,5,length=1000), lambda = 2) %>%
mutate(y=dexp(x, rate = lambda),
grp = ifelse( x > 1, '> 1', '<= 1'))
ggplot(data, aes(x=x, y=y, fill=grp)) +
geom_area() +
labs(y='density') +
theme_bw()
```
We now must resort to calculus to find this area. Or use tables of pre-calculated values. Or use R, remembering that p-functions give the area under the curve to the left of the given value.
```{r}
# P(X > 1) == 1 - P(X <= 1)
1 - pexp(1, rate=2)
```
### Normal Distribution
Undoubtedly the most important distribution in statistics is the normal distribution. If my RV $X$ is normally distributed with mean $\mu$ and standard deviation $\sigma$, its probability density function is given by
$$f(x)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left[\frac{-(x-\mu)^{2}}{2\sigma^{2}}\right]$$
where $\exp[y]$ is the exponential function $e^{y}$. We could slightly rearrange the function to
$$f(x)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left[-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{2}\right]$$
and see this distribution is defined by its expectation $E[X]=\mu$ and its variance $Var[X]=\sigma^{2}$. Notice I could define it using the standard deviation $\sigma$, and different software packages will expect it to be defined by one or the other. R defines the normal distribution using the standard deviation.
```{r, echo=FALSE}
data <- rbind(data.frame(x=seq(-5,8,length=2000), grp=1, mu=0, sigma=1),
data.frame(x=seq(-5,8,length=2000), grp=2, mu=2, sigma=2),
data.frame(x=seq(-5,8,length=2000), grp=3, mu=-3, sigma=0.6)) %>%
mutate(y=dnorm(x, mean=mu, sd=sigma),
mu = paste('mu == ', mu),
sigma = paste('sigma ==', sigma))
label.df <- data.frame( grp=rep(1:3, each=2), label=c('mu == 0','sigma == 1', 'mu == 2','sigma == 2', 'mu==-3','sigma == 0.6'),
x = c(0,0, 4.8,4.8, -4.15,-4.15), y= c(.46,.43, .16,.13, .56,.53))
ggplot(data, aes(x=x, y=y, color=factor(grp))) +
geom_line() +
geom_text(data=label.df, aes(label=label), parse=TRUE) +
labs(y='density') +
theme_bw() +
scale_color_discrete(guide=FALSE)
```
**Example 11**. It is known that the heights of adult males in the US is approximately normal with a mean of 5 feet 10 inches ($\mu=70$ inches) and a standard deviation of $\sigma=3$ inches. Your instructor is a mere 5 feet 4 inches (64 inches). What proportion of the population is shorter than your professor?
```{r}
distr <- data.frame(x=seq(57, 82, length=1000)) %>%
mutate( density = dnorm(x, mean=70, sd=3),
group = ifelse(x<=64, 'Shorter','Taller') )
ggplot(distr, aes(x=x, y=density, fill=group)) +
geom_line() +
geom_area() +
theme_bw()
```
Using R you can easily find this
```{r}
pnorm(64, mean=70, sd=3)
```
### Standardizing
Before we had computers that could calculate these probabilities for any normal distribution, it was important to know how to convert a probability statement from an arbitrary $N\left(\mu,\sigma^{2}\right)$ distribution to a question about a Standard Normal distribution, which is a normal distribution with mean $\mu=0$ and standard deviation $\sigma=1$. If we have
$$X\sim N\left(\mu,\sigma^{2}\right)$$
then
$$Z=\frac{X-\mu}{\sigma}\sim N\left(0,1\right)$$
You might remember doing something similar in an undergraduate statistics course in order to use a table to look up some probability. From the height example, we calculate
$$\begin{aligned}z
&= \frac{64-70}{3} \\
&= \frac{-6}{3} \\
&= -2 \end{aligned}$$
Note that this calculation shows that he is $-2$ standard deviations from the mean. Next we look at a table for $z=-2.00$. To do this we go down to the $-2.0$ row and over to the $.00$ column and find $0.0228$. Only slightly over 2% of the adult male population is shorter!
How tall must a person be to be taller than $80\%$ of the rest of the adult male population? To answer that we must use the table in reverse and look for the $0.8$ value. We find the closest value possible $(0.7995)$ and the $z$ value associated with it is $z=0.84$. Next we solve the standardizing equation for $x$
$$\begin{aligned}
z &= \frac{x-\mu}{\sigma} \\
0.84 &= \frac{x-70}{3} \\
x &= 3(0.84)+70 \\
&= 72.49\;\textrm{inches} \end{aligned}$$
Alternatively we could use the quantile function for the normal distribution (q-function) in R and avoid the imprecision of using a table.
```{r}
qnorm(.8, mean=0, sd=1)
```
**Empirical Rule** - It is from the normal distribution that the empirical rule from the previous chapter is derived. If $X\sim N(\mu,\sigma^{2})$ then
$$\begin{aligned} P(\mu-\sigma\le X\le\mu+\sigma)
&= P(-1 \le Z \le 1) \\
&= P(Z \le 1) - P(Z \le -1) \\
&\approx 0.8413-0.1587 \\
&= 0.6826 \end{aligned}$$
```{r, echo=FALSE}
data <- data.frame(x=seq(-3.5,3.5,by=.1)) %>%
mutate(y=dnorm(x))
data <- rbind( data %>% mutate(sd=1),
data %>% mutate(sd=2),
data %>% mutate(sd=3)) %>%
mutate(CI = ifelse( -sd <= x & x <= sd, 'In', 'Out'),
label = ifelse( sd == 1, '68%', ifelse(sd == 2, '95%', '99.7%')),
sd = paste('mu %+-%', sd, '* sigma'))
ggplot(data, aes(x=x,y=y)) +
geom_line() +
facet_grid(sd ~ ., labeller=label_parsed) +
geom_area(data=data %>% filter(CI=='In'), fill='grey') +
geom_text(x=0, y=.2, aes(label=label)) +
scale_x_continuous(breaks=-3:3, labels=parse(text=paste(-3:3, '*sigma')) ) +
labs(x='', y='density') +
theme_bw()