-
Notifications
You must be signed in to change notification settings - Fork 21
/
Data.Rmd
250 lines (202 loc) · 9.16 KB
/
Data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
---
title: "Organizing Data in R"
author: "Douglas Bates"
date: "2014-09-12"
output: ioslides_presentation
---
# Data frames, examining structure
## Data frames
- Standard rectangular data sets (columns are variables, rows are observations) are stored in `R` as _data frames_.
- The columns can be _numeric_ variables (e.g. measurements or counts) or _factor_ variables (categorical data) or _ordered_ factor variables. These types are called the _class_ of the variable.
- Many `R` packages contain sample data sets used to illustrate the techniques implemented in the package.
- There is also a `datasets` package containing datasets used in the example sections of the base `R` documentation.
* many of these datasets are old and small, dating from the early days of `R`
```{r datasets}
ls("package:datasets")
```
## The `str` and `summary` functions
- The `str` function provides a concise description of the structure of a `data.frame` (or any other class of object in `R`). The `summary` function summarizes each variable according to its class. Both are highly recommended for routine use.
```{r strFormaldehyde}
str(Formaldehyde)
summary(Formaldehyde)
```
## `head` and `tail`
- Entering just the name of the data frame causes it to be printed. For large data frames use the `head` and `tail` functions to view the first few or last few rows.
```{r headswiss}
head(OrchardSprays)
str(OrchardSprays)
```
## `ls.str`
- The operations of listing the objects in a package and providing a brief description of their structure are combined in `ls.str`
```{r lsstr}
ls.str("package:datasets")
```
# Input and saving data objects
## Data input
- The simplest way to input a rectangular data set is to save it as a comma-separated value (`csv`) file and read it with `read.csv`.
- The first argument is the name of the file. On Windows it can be tricky to get the file path correct. The `file.choose` function will bring up a chooser panel.
- `read.csv` just calls `read.table` with a different set of default arguments
- The first argument to `read.csv`, `read.table`, etc. can be a __connection__ or a __URL__ instead of a file name.
- Connection types (see `?connection`)
* `gzfile` - a file compressed with `gzip`
* `bzfile` - a file compressed with `bzip2`
* `xzfile` - a file compressed with `xz`
* `unz` - a single file from a zip archive
## Reading a compressed file or URL
```{r sd1,warning=FALSE}
str(sd1 <- read.csv(gzfile("./sd1.csv.gz","r")))
```
```{r classroom}
str(classroom <- read.csv("http://www-personal.umich.edu/~bwest/classroom.csv"))
```
## Copying, saving and restoring data objects
- Assigning a data object to a new name creates a copy.
- You can save a data object to a file, typically with the extension `.rda`, using the `save` function.
- To restore the data you `load` the file
```{r saveload}
sprays <- InsectSprays
save(sprays,file="sprays.rda")
rm(sprays)
ls()
load("sprays.rda")
names(sprays)
```
## Compression when saving
- By default, when saving to a file with extension `.rda` or `.RData`, the file is compressed with `gzip`.
- Using `compress="xz"` provides a greater compression ratio at the expense of more compute time
- For small data sets it is not important. For large data it can be.
```{r saveclassroom}
save(classroom,file="classroom.rda") # file size is 14.4 KB
save(classroom,file="classroom1.rda",compress="xz") # 9.1 KB
```
# Accessing and modifying variables
## Accessing and modifying variables
- The `$` operator is used to access variables within a data frame.
```{r dollarop}
str(Formaldehyde$carb)
```
- You can also use `$` to assign to a variable name
```{r dollaropleft}
sprays$sqrtcount <- sqrt(sprays$count)
names(sprays)
```
## Removing variables
- Assigning the special value `NULL` to the name of a
variable removes it.
```{r dollaropleftNULL}
sprays$sqrtcount <- NULL
names(sprays)
```
## Using `with`
- In complex expressions it can become tedious to repeatedly
type the name of the data frame.
- The `with` function allows for direct access to variable
names within an expression. It provides "read-only" access.
```{r formalfoo}
Formaldehyde$carb * Formaldehyde$optden
with(Formaldehyde, carb * optden)
```
## Using `within`
- The `within` function provides read-write access to a data
frame. It does not change the original frame; it returns a modified
copy. To change the stored object you must assign the result
to the name.
```{r within}
sprays <- within(sprays, sqrtcount <- sqrt(count))
str(sprays)
```
# Data Organization
## Data Organization
- Careful consideration of the data layout for experimental or
observational data is repaid in later ease of analysis. Sadly, the
widespread use of spreadsheets does not encourage such careful
consideration.
- If you are organizing data in a table, use consistent data
types within columns. Databases require this; spreadsheets don't.
- A common practice in some disciplines is to convert
categorical data to 0/1 "indicator variables or to code the levels
as numbers with a separate "data key". This practice is
unnecessary and error-inducing in `R`. When you see categorical
variables coded as numeric variables, change them to `factor`s or
`ordered` factors.
- Spreadsheets also encourage the use of a "wide" data format,
especially for longitudinal data. Each row corresponds to an
experimental unit and multiple observation occasions are
represented in different columns. The "long" format is
preferred in `R`.
## Converting numeric variables to factors
- The `factor` (`ordered`) function creates a factor
(ordered factor) from a vector. Factor labels can be specified in
the optional `labels` argument.
- Suppose the `spray` variable in the `InsectSprays`
data was stored as numeric values $1, 2,\dots,6$. We convert it
back to a factor with `factor`.
```{r sprays}
str(sprays <- within(InsectSprays, spray <- as.integer(spray)))
str(sprays <- within(sprays, spray <- factor(spray, labels = LETTERS[1:6])))
```
# Subsets of data frames
## Subsets of data frames
- The `subset` function is used to extract a subset of the
rows or of the columns or of both from a data frame.
- The first argument is the name of the data frame. The
second is an expression indicating which rows are to be selected.
- This expression often uses logical operators such as
`==`, the equality comparison, or `!=`, the inequality
comparison, `>=`, meaning "greater than or equal to", etc.
```{r sprayA}
str(sprayA <- subset(sprays, spray == "A"))
```
\item The optional argument `select` can be used to specify the
variables to be included.
## Subsets and factors
- The way that factors are defined, a subset of a factor retains
the original set of levels. Usually this is harmless but
sometimes it can cause unexpected results.
- You can "drop unused levels" by applying `factor` to
the factor. Many functions, such as `xtabs`, which is used to
create cross-tabulations, have optional arguments with names like
`drop.unused.levels` to automate this.
```{r xtabssprays}
xtabs( ~ spray, sprayA)
xtabs( ~ spray, sprayA, drop = TRUE)
```
## Dropping unused levels in the spray factor
```{r spraysdrop}
str(sprayA <- within(sprayA, spray <- factor(spray)))
xtabs( ~ spray, sprayA)
```
## The `%in%` operator
\item Another useful comparison operator is `%in%` for
selecting a subset of the values in a variable.
```{r sprayDEF}
str(sprayDEF <- subset(sprays, spray %in% c("D","E","F")))
```
## "Long" and "wide" forms of data
- Spreadsheet users tend to store balanced data, such as `InsectSprays`, across many columns. This is called the "wide" format. The `unstack` function converts a simple "long" data set to wide; `stack` for the other way.
```{r unstack}
str(unstack(InsectSprays))
```
- The problem with the wide format is that it only works for balanced data. A designed experiment may produce balanced data (although "Murphy's Law" would indicate otherwise) but observational data are rarely balanced. Use the long format when possible.
## Using reshape
- The `reshape` function allows for more general translations of long to wide and vice-versa. It is specifically intended for longitudinal data.
- There is also a package called `"reshape"` with even more general (but potentially confusing) capabilities.
- Phil Spector's book, __Data Manipulation with R__ (Springer, 2008) covers this topic in more detail.
```{r classroomfactor,echo=FALSE,results='hide'}
classroom <- within(classroom,schoolid <- factor(schoolid))
```
## Determining unique rows in a data frame
- One disadvantage of keeping data in the long format is
redundancy and the possibility of inconsistency.
- In the first set of exercises you are asked to create a data
frame `classroom` from a csv file available on the Internet.
Each of the `r nrow(classroom)` rows corresponds to a student
in a classroom in a school. There is one numeric "school level"
covariate, `housepov`.
- To check if `housepov` is stored consistently we select
the unique combinations of only those two columns
```{r clasuniq}
str(unique(subset(classroom, select = c(schoolid,housepov))))
```
Because there are 107 unique combinations and 107 schools,
`housepov` is consistent with `schoolid`.