-
Notifications
You must be signed in to change notification settings - Fork 2
/
01_Data_Summary.Rmd
executable file
·520 lines (386 loc) · 30.2 KB
/
01_Data_Summary.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
---
output:
pdf_document: default
html_document: default
---
```{r, echo=FALSE}
# Unattach any packages that happen to already be loaded. In general this is unecessary
# but is important for the creation of the book to not have package namespaces
# fighting unexpectedly.
pkgs = names(sessionInfo()$otherPkgs)
if( length(pkgs > 0)){
pkgs = paste('package:', pkgs, sep = "")
for( i in 1:length(pkgs)){
detach(pkgs[i], character.only = TRUE, force=TRUE)
}
}
# Set my default chunk options
knitr::opts_chunk$set(
fig.height=3, # Default figure size is 3 inches
cache=TRUE, # Cache the R code Run to Run
tidy=FALSE, # display code as typed
size="small") # slightly smaller font for code
```
# Summary Statistics and Graphing
When confronted with a large amount of data, we seek to summarize the data into statistics that somehow capture the essence of the data with as few numbers as possible. Graphing the data has a similar goal... to reduce the data to an image that represents all the key aspects of the raw data. In short, we seek to simplify the data in order to understand the trends while not obscuring important structure.
```{r, message=FALSE, warning=FALSE}
# Every chapter, we will load all the librarys we will use at the beginning
# of the chapter. These commands will start most every homework assignment
# for this class, and likely, every R script you write.
library(ggplot2) # graphing functions
library(dplyr) # data summary tools
# Set default behavior of ggplot2 graphs to be black/white theme
theme_set(theme_bw())
```
For this chapter, we will consider data from a the 2005 Cherry Blossom 10 mile run that occurs in Washington DC. This data set has 8636 observations that includes the runners state of residence, official time (gun to finish, in seconds), net time (start line to finish, in seconds), age, and gender of the runners.
```{r}
data( TenMileRace, package='mosaicData')
head(TenMileRace) # examine the first few rows of the data
```
In general, I often need to make a distinction between two types of data.
* Discrete (also called Categorical) data is data that can only take a small set of particular values. For example a college student's grade can be either A, B, C, D, or F. A person's sex can be only Male or Female [^1]. Discrete data could also be numeric, for example a bird could lay 1, 2, 3, ... eggs in a breeding season.
[^1]: Actually this isn't true as both gender and sex are far more complex. However from a statistical point of view it is often useful to simplify our model of the world. George Box famously said, “All models are wrong, but some are useful.”
* Continuous data is data that can take on an infinite number of numerical values. For example a person's height could be 68 inches, 68.2 inches, 68.23212 inches.
To decided if a data attribute is discrete or continuous, I often as “Does a fraction of a value make sense?” If so, then the data is continuous.
## Graphical summaries of data
### Univariate - Categorical
If we have univariate data about a number of groups, often the best way to display it is using barplots. They have the advantage over pie-charts that groups are easily compared.
```{r}
ggplot(TenMileRace, aes(x=sex)) + geom_bar()
```
One thing that can be misleading is if the zero on the y-axis is removed. In the following graph it looks like there are twice as many female runners as male until you examine the y-axis closely. In general, the following is a very misleading graph.
```{r}
ggplot(TenMileRace, aes(x=sex)) +
geom_bar() +
coord_cartesian(ylim = c(4300, 4330))
```
### Univariate - Continuous
A histogram looks very similar to a bar plot, but is used to represent continuous data instead of categorical and therefore the bars will actually be touching.
```{r, warning=FALSE, message=FALSE}
ggplot(TenMileRace, aes(x=net)) + geom_histogram()
```
Often when a histogram is presented, the y-axis is labeled as “frequency” or “count” which is the number of observations that fall within a particular bin. However, it is often desirable to scale the y-axis so that if we were to sum up the area $(height * width)$ then the total area would sum to 1. The re-scaling that accomplishes this is $$density=\frac{\#\;observations\;in\;bin}{total\;number\;observations}\cdot\frac{1}{bin\;width}$$
### Bivariate - Categorical vs Continuous
We often wish to compare response levels from two or more groups of interest. To do this, we often use side-by-side boxplots. Notice that each observation is associated with a continuous response value and a categorical value.
```{r}
ggplot(TenMileRace, aes(x=sex, y=net)) + geom_boxplot()
```
In this graph, the edges of the box are defined by the 25% and 75% percentiles. That is to say, 25% of the data is to the below of the box, 50% of the data is in the box, and the final 25% of the data is to the above of the box. The line in the center of the box represents the 50% percentile. The dots are data points that traditionally considered outliers. We will define the Inter-Quartile Range (IQR) as the length of the box. It is conventional to define any observation more than 1.5*IQR from the box as an considered an outlier. In the above graph it is easy to see that the median time for the males is lower than for females, but the box width (one measure of the spread of the data) is approximately the same.
Because boxplots simplify the distribution to just 5 numbers, looking at side-by-side histograms might give similar information.
```{r, warning=FALSE, message=FALSE}
ggplot(TenMileRace, aes(x=net)) +
geom_histogram() +
facet_grid( . ~ sex ) # side-by-side plots based on sex
```
Orientation of graphs can certainly matter. In this case, it makes sense to stack the two graphs to facilitate comparisons in where the centers are and it is more obvious that the center of the female distribution is about 500 to 600 seconds higher than then center of the male distribution.
```{r, warning=FALSE, message=FALSE}
ggplot(TenMileRace, aes(x=net)) +
geom_histogram() +
facet_grid( sex ~ . ) # side-by-side plots based on sex
```
### Bivariate - Continuous vs Continuous
Finally we might want to examine the relationship between two continuous random variables. For example, we might wish to explore the relationship between a runners age and their net time.
```{r}
ggplot(TenMileRace, aes(x=age, y=net, color=sex)) +
geom_point()
```
## Measures of Centrality
The most basic question to ask of any dataset is 'What is the typical value?' There are several ways to answer that question and they should be familiar to most students.
### Mean
Often called the average, or arithmetic mean, we will denote this special statistic with a bar. We define $$\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_{i}=\frac{1}{n}\left(x_{1}+x_{2}+\dots+x_{n}\right)$$
If we want to find the mean of five numbers $\left\{ 3,6,4,8,2\right\}$ the calculation is
$$\bar{x} = \frac{1}{5}\left(3+6+4+8+2\right)
= \frac{1}{5}\left(23\right)
= 23/5
= 4.6$$
This can easily be calculated in R by using the function `mean()`. We first extract the column we are interested in using the notation: `DataSet$ColumnName` where the $ signifies grabbing the column.
```{r}
# TenMileRace$net is the set of data to calculate the mean of.
mean( TenMileRace$net ) # Simplest way of doing this calculation
# Using the dplyr package we first specify the data set
# Then specify we wish to summarize() the data set
# The summary we want to do is to calculate the mean of the 'net' column.
# and we want to name what we are about to create as Calculated.Mean
TenMileRace %>% summarise( Calculated.Mean = mean(net) )
```
### Median
If the data were to be ordered, the median would be the middle most observation (or, in the case that $n$ is even, the mean of the two middle most values).
In our simple case of five observations $\left\{ 3,6,4,8,2\right\}$, we first sort the data into $\left\{ 2,3,4,6,8\right\}$ and then the middle observation is clearly $4$.
In R the median is easily calculated by the function `median()`.
```{r}
# Use the median() function to calculate the median for us.
# median( TenMileRace$net )
TenMileRace %>% summarise( Median = median(net) )
```
### Mode
This is peak in the distribution. A distribution might have a single peak or multiple peaks.This measure of "center" is not often used in quantitative analyses, but is often helps provide a nice description.
When creating a histogram from a set of data, often the choice of binwidth will affect the modes of the graph. Consider the following graphs of $n=200$ data points, where we have slightly different binwidths.
```{r, echo=FALSE}
N <- 120
seed <- round(runif(1, 0, 10000)); set.seed(seed); #set.seed(2398)
data <- TenMileRace %>% sample_n(200)
# data.frame(x=c( rnorm(N, mean=2, sd=3), rnorm(N/2, mean=6, sd=2)))
P1 <- ggplot(data, aes(x=net)) + geom_histogram(binwidth=120) + ggtitle('Binwidth: 120 sec')
P2 <- ggplot(data, aes(x=net)) + geom_histogram(binwidth=240) + ggtitle('Binwidth: 240 sec')
P3 <- ggplot(data, aes(x=net)) + geom_histogram(binwidth=600) + ggtitle('Binwidth: 600 sec')
Rmisc::multiplot(P1, P2, P3, cols = 3)
```
With the two smaller binwidths, sample randomness between adjacent bins obscures the overall shape and we have many different modes. However the *larger* binwidth results in a histogram that more effectively communicates the shape of the distribution and has just a single mode at around 6000 seconds (= 100 minutes = 1 hour 40 minutes). When making histograms the choice of binwidth (or equivalently, the number of bins) should not be ignored and a balance should be struck between simplifying the data too much vs seeing too much of the noise resulting from the sample randomess.
### Examples
* Suppose a retired professor my father were to become bored and enroll into the the author's STA 570 course, how would that affect the mean and median age of the STA 570 students?
+ The mean would move much more than the median. Suppose the class has 5 people right now, ages 21, 22, 23, 23, 24 and therefore the median is 23. When the retired professor joins, the ages will be 21, 22, 23, 23, 24, 72 and the median will remain 23. However, the mean would move because we add in such a large outlier. Whenever we are dealing with skewed data, the mean is pulled toward the outlying observations.
* In 2010 during a player's strike in the NFL, the median NFL player salary was $770,000 while the mean salary was $1.9 million. Clearly the player's union would talk about the median while the team owners prefered to talk about the mean. Why is there such a difference?
+ Because salary data contains outliers (e.g. superstar players with salaries in excess of 20 million) and the minimum salary for a rookie is $375,000. Financial data often reflects a highly skewed distribution and the median is often a better measure of centrality in these cases.
## Measures of Spread
The second question to ask of a dataset is 'How much spead is in the data?' The fancier (and eventually more technical) word for spread is 'variability'. As with centrality, there are several ways to measure this.
### Range
Range is the distance from the largest to the smallest value in the dataset.
```{r}
# max( TenMileRace$net ) - min( TenMileRace$net )
TenMileRace %>% summarise( Range = max(net) - min(net) )
```
In general, this method is highly sensitive to outlier observations and isn't often used.
### Inter-Quartile Range
The p-th percentile is the observation (or observations) that has at most $p$ percent of the observations below it and $(1-p)$ above it, where $p$ is between 0 and 100. The median is the $50$th percentile. Often we are interested in splitting the data into four equal sections using the $25$th, $50$th, and $75$th percentiles (which, because it splits the data into four sections, we often call these the $1$st, $2$nd, and $3$rd quartiles).
In general we could be interested in dividing the data up into an arbitrary number of sections, and refer to those as quantiles of my data.
```{r}
quantile( TenMileRace$net ) # gives the 5-number summary by default
```
The inter-quartile range (IQR) is defined as the distance from the $3$rd quartile to the $1$st.
```{r}
# IQR( TenMileRace$net )
TenMileRace %>% summarise( CalcIQR = IQR(net) )
```
Notice that we've defined IQR before when we looked at box-and-whisker plots and this is exactly the length of the box part of a box-and-whisker plot.
### Variance
One way to measure the spread of a distribution is to ask “what is the typical distance of an observation to the mean?” We could define the $i$th deviate as
$$e_{i}=x_{i}-\bar{x}$$
and then ask what is the average deviate? The problem with this approach is that the sum (and thus the average) of all deviates is always 0.
$$\sum_{i=1}^{n}(x_{i}-\bar{x}) = \sum_{i=1}^{n}x_{i}-\sum_{i=1}^{n}\bar{x}
= n\frac{1}{n}\sum_{i=1}^{n}x_{i}-n\bar{x}
= n\bar{x}-n\bar{x}
= 0$$
The big problem is that about half the deviates are negative and the others are positive. What we really care is the distance from the mean, not the sign. So we could either take the absolute value, or square it.
There are some really good theoretical reasons to chose the square option. Squared terms are easier to deal with compared to absolute values, but more importantly, the spread of the normal distribution is parameterized via squared distances from the mean. Because the normal distribution is so important, we've chosen to define the sample variance so it matches up with the natural spread parameter of the normal distribution. So we square the deviates and then find the average deviate size (approximately) and call that the sample variance.
$$s^{2}=\frac{1}{n-1}\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}$$
Why do we divide by $n-1$ instead of $n$?
1. If we divide by $n$, then on average, we would tend to underestimate the population variance $\sigma^{2}$.
2. The reason is because we are using the same set of data to estimate $\sigma^{2}$ as we did to estimate the population mean ($\mu$). If we could use
$$\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\mu\right)^{2}$$
as the estimator, we would be fine. But because we have to replace $\mu$ with $\bar{x}$ we have to pay a price.
3. Because the estimation of $\sigma^{2}$ requires the estimation of one other quantity, and using using that quantity, you only need $n-1$ data points and can then figure out the last one, we have used one degree of freedom on estimating the mean and we need to adjust the formula accordingly.
In later chapters we'll give this quantity a different name, so we'll introduce the necessary vocabulary here. Let $e_{i}=x_{i}-\bar{x}$ be the error left after fitting the sample mean. This is the deviation from the observed value to the “expected value” $\bar{x}$. We can then define the Sum of Squared Error as
$$SSE=\sum_{i=1}^{n}e_{i}^{2}$$
and the Mean Squared Error as $$MSE=\frac{SSE}{df}=\frac{SSE}{n-1}=s^{2}$$ where $df=n-1$ is the appropriate degrees of freedom.
Calculating the variance of our small sample of five observations $\left\{ 3,6,4,8,2\right\}$, recall that the sample mean was $\bar{x}=4.6$
+-------+-------------------+--------------------+
| $x_i$ | $(x_i-\bar{x})$ | $(x_i-\bar{x})^2$ |
+=======+===================+====================+
| 3 | -1.6 | 2.56 |
+-------+-------------------+--------------------+
| 6 | 1.4 | 1.96 |
+-------+-------------------+--------------------+
| 4 | -0.6 | 0.36 |
+-------+-------------------+--------------------+
| 8 | 3.4 | 11.56 |
+-------+-------------------+--------------------+
| 2 | -2.6 | 6.76 |
+-------+-------------------+--------------------+
| | | SSE = 23.2 |
+-------+-------------------+--------------------+
and so the sample variance is $$s^2 = \frac{SSE}{n-1} = \frac{23.2}{(n-1)} = \frac{23.2}{4}=5.8$$
Clearly this calculation would get very tedious to do by hand and computers will be much more accurate in these calculations. In R, the sample variance is easily calculated by the function `var()`.
```{r}
ToyData <- data.frame( x=c(3,6,4,8,2) )
# var( ToyData$x )
ToyData %>% summarise( s2 = var(x) )
```
For the larger TenMileRace data set, the variance is just as easily calculated.
```{r}
# var( TenMileRace$net )
TenMileRace %>% summarise( s2 = var(net) )
```
### Standard Deviation
The biggest problem with the sample variance statistic is that the units are in the original units-squared. That means if you are looking at data about car fuel efficiency, then the values would be in mpg$^{2}$ which are units that I can't really understand. The solution is to take the positive square root, which we will call the sample standard deviation. $$s=\sqrt{s^{2}}$$
But why do we take the jog through through variance? Mathematically the variance is more useful and most distributions (such as the normal) are defined by the variance term. Practically though, standard deviation is easier to think about.
The sample standard deviation is important enough for R to have a function `sd()` that will calculate it for you.
```{r}
# sd( TenMileRace$net )
TenMileRace %>% summarise( s = sd(net) )
```
### Coefficient of Variation
Suppose we had a group of animals and the sample standard deviation of the animals lengths was 15 cm. If the animals were elephants, you would be amazed at their uniformity in size, but if they were insects, you would be astounded at the variability. To account for that, the coefficient of variation takes the sample standard deviation and divides by the absolute value of the sample mean (to keep everything positive)
$$CV=\frac{s}{\vert\bar{x}\vert}$$
```{r}
TenMileRace %>% summarise( s = sd(net),
xbar = mean(net),
cv = s / abs(xbar) )
```
```{r}
# Previously using dplyr notation didn't help too much, but if we wanted
# to calculate the statistics separately for each sex, the dplyr solution
# is MUCH easier.
TenMileRace %>% # Summarize the Ten Mile Race Data
group_by(sex) %>% # Subsequent actions are done seperately
summarise( xbar = mean(net), # for each gender
s = sd(net), #
cv = s / abs(xbar) ) # Calculate three different summary stats
```
### Empirical Rule of Thumb
For any mound-shaped sample of data the following is a reasonable rule of thumb:
| Interval | Approximate percent of Measurements |
|:----------------:|:-----------------------------------------:|
| $\bar{x}\pm s$ | 68% |
| $\bar{x}\pm 2s$ | 95% |
| $\bar{x}\pm 3s$ | 99.7% |
```{r, echo=FALSE, warning=FALSE, message=FALSE}
n <- 1000
Line.Data <- data.frame(x=seq(-4,4,length=n)) %>%
mutate(y=dnorm(x))
Text.Data1 <- data.frame(x=0, y=.25, label='bar(x) %+-% s', group="1 Std Dev") %>%
rbind(data.frame(x=0, y=.25, label='bar(x) %+-% 2*s', group="2 Std Dev")) %>%
rbind(data.frame(x=0, y=.25, label='bar(x) %+-% 3*s', group="3 Std Dev"))
Text.Data2 <- data.frame(x=0, y=.15, label='68%', group="1 Std Dev") %>%
rbind(data.frame(x=0, y=.15, label='95%', group="2 Std Dev")) %>%
rbind(data.frame(x=0, y=.15, label='99.7%', group="3 Std Dev"))
Area.Data <- Line.Data %>% filter(abs(x)<1) %>% mutate(group="1 Std Dev") %>%
rbind( Line.Data %>% filter(abs(x)<2) %>% mutate(group="2 Std Dev") ) %>%
rbind( Line.Data %>% filter(abs(x)<3) %>% mutate(group="3 Std Dev") )
ggplot( Area.Data, aes(x=x)) +
geom_line( data=Line.Data, aes(x=x, y=y)) +
geom_area(data=Area.Data, aes(x=x,y=y), fill='grey', alpha=.6) +
geom_text( data=Text.Data1, aes(x=x, y=y, label=label), parse=TRUE) +
geom_text( data=Text.Data2, aes(x=x, y=y, label=label)) +
facet_grid( group ~ .) +
labs(y='', x='') +
scale_y_continuous(breaks=NULL) +
scale_x_continuous(
breaks=c(-3,-2,-1,0,1,2,3), minor_breaks = NULL,
labels = c( expression(bar(x)-3*s ),
expression(bar(x)-2*s ),
expression(bar(x)-s ),
expression(bar(x)),
expression(bar(x)+s ),
expression(bar(x)+2*s),
expression(bar(x)+3*s)))
```
## Shape
We want to be able to describe the shape of a distribution and this section introduces the standard vocabulary.
### Symmetry
A distribution is said to be symetric if there is a point along the x-axis (which we'll call $\mu$) which acts as a mirror and $f( -|x-\mu| ) = f( |x-\mu| )$. In the following graphs, the point of symmetry is marked with a red line.
```{r, echo=FALSE}
data <- data.frame(x=seq(-3,3, length=1000)) %>% mutate( y = dnorm(x, mean=0, sd=1))
P1 <- ggplot(data, aes(x=x, y=y)) + geom_line() + geom_vline(xintercept=0, color='red') +
labs(x=NULL, y=NULL) + scale_x_continuous(labels=NULL) + scale_y_continuous(labels=NULL)
data <- data.frame(x=seq(-5,5,length=1000)) %>%
mutate( y = (dnorm(x, mean=-2, sd=1) + dnorm(x,mean=2,sd=1))/2 )
P2 <- ggplot(data, aes(x=x, y=y)) + geom_line() + geom_vline(xintercept=0, color='red') +
labs(x=NULL, y=NULL) + scale_x_continuous(labels=NULL) + scale_y_continuous(labels=NULL)
Rmisc::multiplot(P1, P2, cols = 2)
```
A distribution that is not symetric is said to be asymmetric.
### Unimodal or Multi-modal
Recall one measure of centrality was mode. If there is just a single mode, then we refer to the distribution as unimodal. If there is two or more we would refer to it as bimodal or multi-modal.
### Skew
If a distribution has a heavier tail on one side or the other, we refer to it as a *skewed* distribution and the direction of the skew is towards the heavier tail. Usually (but not always), an asymmetric is skewed.
```{r, echo=FALSE}
N <- 20000
data <- data.frame(x = rexp(N, 1))
P1 <- ggplot(data, aes(x=x)) + geom_histogram(bins=50) + ggtitle('Right Skewed') +
labs(x=NULL, y=NULL) + scale_x_continuous(labels=NULL) + scale_y_continuous(labels=NULL)
data <- data.frame(x = -1 * rgamma(N, shape = 4, scale = 5))
P2 <- ggplot(data, aes(x=x)) + geom_histogram(bins=50) + ggtitle('Left Skewed') +
labs(x=NULL, y=NULL) + scale_x_continuous(labels=NULL) + scale_y_continuous(labels=NULL)
Rmisc::multiplot(P2, P1, cols = 2)
```
## Exercises
1. O&L 3.21. The ratio of DDE (related to DDT) to PCB concentrations in bird eggs has been shown to have had a number of biological implications. The ratio is used as an indication of the movement of contamination through the food chain. The paper “The ratio of DDE to PCB concentrations in Great Lakes herring gull eggs and its us in interpreting contaminants data” reports the following ratios for eggs collected at 13 study sites from the five Great Lakes. The eggs were collected from both terrestrial and aquatic feeding birds.
| Source Type | DDE to PCB Ratio |
|:-----------------:|:----------------------------------------------------------------------|
| **Terrestrial** | 76.50, 6.03, 3.51, 9.96, 4.24, 7.74, 9.54, 41.70, 1.84, 2.5, 1.54 |
| **Aquatic** | 0.27, 0.61, 0.54, 0.14, 0.63, 0.23, 0.56, 0.48, 0.16, 0.18 |
a) By hand, compute the mean and median separately for each type of feeder.
b) Using your results from part (a), comment on the relative sensitivity of the mean and median to extreme values in a data set.
c) Which measure, mean or median, would you recommend as the most appropriate measure of the DDE to PCB level for both types of feeders? Explain your answer.
2. O&L 3.31. Consumer Reports in its June 1998 issue reports on the typical daily room rate at six luxury and nine budget hotels. The room rates are given in the following table.
| Hotel Type | Nightly Rate |
|:-------------:|:-----------------------------------------------------|
| Luxury | 175, 180, 120, 150, 120, 125 |
| Budget | 50, 50, 49, 45, 36, 45, 50, 50, 40 |
a) By hand, compute the means and standard deviations of the room rates for each class of hotel.
b) Give a practical reason why luxury hotels might have higher variability than the budget hotels. (Don't just say the standard deviation is higher because there is more spread in the data, but rather think about the Hotel Industry and why you might see greater price variability for upscale goods compared to budget items.)
3. Suppose that we have two groups of individuals, each with a mean value $\bar{x}_i$. We wish to combine the two groups together and find the mean value for the combined group (denoted $\bar{x}_c$).
a) Suppose $n_1=1$ and $n_2=1$ (i.e. there is one individual in each group) and the means of the two groups are $\bar{x}_1 = 10$ and $\bar{x}_2=14$. What is the sum of the ages of indivuals in the combined group? What is the mean age of the combined group?
b) Suppose that $n_1=50$? What is the mean age of the combined group?
c) Finally suppose that $n_2=75$. What is the mean age of the combined group?
d) Write a formula for the weighted average of the mean of the combined group as $$\bar{x}_c = w_1 \bar{x}_1 + w_2 \bar{x}_2 = \sum_{i=1}^2 w_i \bar{x}_i$$ by defining the group *weighting factors* $w_i$.
e) Explain why the weights represent the relative contribution to the overall mean for each group.
4. Use R to confirm your calculations in problem 1 (the pollution data). Show the code you used and the subsequent output. It will often be convenient for me to give you code that generates a data frame instead of uploading an Excel file and having you read it in. The data can be generated using the following commands:
```{r}
PolutionRatios <- data.frame(
Ratio = c(76.50, 6.03, 3.51, 9.96, 4.24, 7.74, 9.54, 41.70, 1.84, 2.5, 1.54,
0.27, 0.61, 0.54, 0.14, 0.63, 0.23, 0.56, 0.48, 0.16, 0.18 ),
Type = c( rep('Terrestrial',11), rep('Aquatic',10) ) )
# Print out some of the data to confirm what the column names are
head( PolutionRatios )
```
*Hint: for computing the means and medians for each type of feeder separately, the `group_by()` command we demonstated earlier in the chapter is convenient.*
5. Use R to confirm your calculations in problem 2 (the hotel data). Show the code you used and the subsequent output. The data can be loaded into a data frame using the following commands Show the code you used and the subsequent output:
```{r}
Hotels <- data.frame(
Price = c(175, 180, 120, 150, 120, 125, 50, 50, 49, 45, 36, 45, 50, 50, 40),
Type = c( rep('Luxury',6), rep('Budget', 9) ) )
# Print out some of the data to confirm what the column names are
head( Hotels )
```
6. For the hotel data, create side-by-side box-and-whisker plots to compare the prices.
7. Match the following histograms to the appropriate boxplot.
```{r, echo=FALSE}
n <- 20000
m1 <- rnorm(n, mean=20, sd=2.5)
m2 <- rnorm(n, mean=20, sd=10)
m3 <- rgamma(n, shape=4, scale=15/4)
m4 <- c(rnorm(n/2, mean=0, sd=6), rnorm(n/2, mean=30, sd=6))
d1 <- data.frame(obs=c(m1,m2,m3,m4), Group=rep(c('A','B','C','D'), each=n))
d2 <- data.frame(obs=c(m2,m1,m4,m3), Group=rep(c('1','2','3','4'), each=n))
library(ggplot2)
p <- ggplot(d1, aes(obs)) +
geom_histogram(aes(y=..density..), binwidth=1) +
facet_wrap(~Group, scales='free') + xlab('Observed Value')
print(p)
```
```{r, echo=FALSE}
p <- qplot(Group, obs, data=d2, geom='boxplot') +
xlab('') + ylab('Observed Value')
print(p)
```
a) Histogram A goes with boxplot __________
b) Histogram B goes with boxplot __________
c) Histogram C goes with boxplot __________
d) Histogram D goes with boxplot __________
8. Twenty-five employees of a corporation have a mean salary of $62,000 and the sample standard deviation of those salaries is $15,000. If each employee receives a bonus of $1,000, does the standard deviation of the salaries change? Explain your reasoning.
9. Suppose that we are examining the histograms of the salaries of two corporations. They have the same mean salary, but very different shapes. Describe distributions that would result in different median salaries.
10. The chemicals in clay used to make pottery can differ depending on the geographical region where the clay originated. Sometimes, archaeologists use a chemical analysis of clay to help identify where a piece of pottery originated. Such an analysis measures the amount of a chemical in the clay as a percent of the total weight of the piece of pottery. The boxplots below summarize analyses done for three chemicals—X, Y, and Z—on pieces of pottery that originated at one of three sites: I, II, or III.
```{r, echo=FALSE}
ni <- 100
data <- rbind(
data.frame( Site='I', Chemical='X', Y=runif(ni, 6, 8) ),
data.frame( Site='I', Chemical='Y', Y=runif(ni, 11, 15) ),
data.frame( Site='I', Chemical='Z', Y=runif(ni, 4, 10) ),
data.frame( Site='II', Chemical='X', Y=runif(ni, 5, 7) ),
data.frame( Site='II', Chemical='Y', Y=runif(ni, 1.8, 4) ),
data.frame( Site='II', Chemical='Z', Y=runif(ni, 6, 8) ),
data.frame( Site='III', Chemical='X', Y=runif(ni, 5, 7.5) ),
data.frame( Site='III', Chemical='Y', Y=runif(ni, 6, 8) ),
data.frame( Site='III', Chemical='Z', Y=runif(ni, 3, 11) )
)
ggplot(data, aes(x=Chemical, y=Y)) +
geom_boxplot() +
facet_grid(.~Site, labeller=label_both) +
ylab('Percent of Total Weight') +
scale_y_continuous(breaks=seq(0,16,by=2))
```
a) For chemical Z, describe how the percents found in the pieces of pottery are similar and how they differ among the three sites.
b) Consider a piece of pottery known to have originated at one of the three sites, but the actual site is not known.
i) Suppose an analysis of the clay reveals that the sum of the percents of the three chemicals X, Y, and Z is $20.5\%$. Based on the boxplots, which site—I, II, or III—is the most likely site where the piece of pottery originated? Justify your choice.
ii) Suppose only one chemical could be analyzed in the piece of pottery. Which chemical—X, Y, or Z— would be the most useful in identifying the site where the piece of pottery originated? Justify your choice.