Data.Rmd

---
title: "Organizing Data in R"
author: "Douglas Bates"
date: "2014-09-12"
output: ioslides_presentation
---
# Data frames, examining structure

## Data frames
- Standard rectangular data sets (columns are variables, rows are observations) are stored in `R` as _data frames_.
- The columns can be _numeric_ variables (e.g. measurements or counts) or _factor_ variables (categorical data) or _ordered_ factor variables.  These types are called the _class_ of the variable.
- Many `R` packages contain sample data sets used to illustrate the techniques implemented in the package.
- There is also a `datasets` package containing datasets used in the example sections of the base `R` documentation.
    * many of these datasets are old and small, dating from the early days of `R`
```{r datasets}
ls("package:datasets")
```

## The `str` and `summary` functions
- The `str` function provides a concise description of the structure of a `data.frame` (or any other class of object in `R`).  The `summary` function summarizes each variable according to its class.  Both are highly recommended for routine use.
```{r strFormaldehyde}
str(Formaldehyde)
summary(Formaldehyde)
```

## `head` and `tail`
- Entering just the name of the data frame causes it to be printed.  For large data frames use the `head` and `tail` functions to view the first few or last few rows.
```{r headswiss}
head(OrchardSprays)
str(OrchardSprays)
```

## `ls.str`

- The operations of listing the objects in a package and providing a brief description of their structure are combined in `ls.str`
```{r lsstr}
ls.str("package:datasets")
```

# Input and saving data objects

## Data input
- The simplest way to input a rectangular data set is to save it as a comma-separated value (`csv`) file and read it with `read.csv`.
- The first argument is the name of the file.  On Windows it can be tricky to get the file path correct.  The `file.choose` function will bring up a chooser panel.
- `read.csv` just calls `read.table` with a different set of default arguments
- The first argument to `read.csv`, `read.table`, etc. can be a __connection__ or a __URL__ instead of a file name.
- Connection types (see `?connection`)
    * `gzfile` - a file compressed with `gzip`
    * `bzfile` - a file compressed with `bzip2`
    * `xzfile` - a file compressed with `xz`
    * `unz` - a single file from a zip archive

## Reading a compressed file or URL
```{r sd1,warning=FALSE}
str(sd1 <- read.csv(gzfile("./sd1.csv.gz","r")))
```
```{r classroom}
str(classroom <- read.csv("http://www-personal.umich.edu/~bwest/classroom.csv"))
```

## Copying, saving and restoring data objects
- Assigning a data object to a new name creates a copy.
- You can save a data object to a file, typically with the extension `.rda`, using the `save` function.
- To restore the data you `load` the file 
```{r saveload}
sprays <- InsectSprays
save(sprays,file="sprays.rda")
rm(sprays)
ls()
load("sprays.rda")
names(sprays)
```

## Compression when saving
- By default, when saving to a file with extension `.rda` or `.RData`, the file is compressed with `gzip`.
- Using `compress="xz"` provides a greater compression ratio at the expense of more compute time
- For small data sets it is not important.  For large data it can be.
```{r saveclassroom}
save(classroom,file="classroom.rda")   # file size is 14.4 KB
save(classroom,file="classroom1.rda",compress="xz") # 9.1 KB
```

# Accessing and modifying variables

## Accessing and modifying variables
  
- The `$` operator is used to access variables within a data frame.
```{r dollarop}
str(Formaldehyde$carb)
```     
- You can also use `$` to assign to a variable name
```{r dollaropleft}
sprays$sqrtcount <- sqrt(sprays$count)
names(sprays)
```     

## Removing variables

- Assigning the special value `NULL` to the name of a
  variable removes it.
```{r dollaropleftNULL}
sprays$sqrtcount <- NULL
names(sprays)
```     


## Using `with`

- In complex expressions it can become tedious to repeatedly
  type the name of the data frame.
- The `with` function allows for direct access to variable
  names within an expression.  It provides "read-only" access.
```{r formalfoo}
Formaldehyde$carb * Formaldehyde$optden
with(Formaldehyde, carb * optden)
``` 

## Using `within`
- The `within` function provides read-write access to a data
  frame.  It does not change the original frame; it returns a modified
  copy.  To change the stored object you must assign the result
  to the name.
```{r within}
sprays <- within(sprays, sqrtcount <- sqrt(count))
str(sprays)
``` 

# Data Organization
  
## Data Organization
- Careful consideration of the data layout for experimental or
  observational data is repaid in later ease of analysis.  Sadly, the
  widespread use of spreadsheets does not encourage such careful
  consideration.
- If you are organizing data in a table, use consistent data
  types within columns.  Databases require this; spreadsheets don't.
- A common practice in some disciplines is to convert
  categorical data to 0/1 "indicator variables or to code the levels
  as numbers with a separate "data key".  This practice is
  unnecessary and error-inducing in `R`.  When you see categorical
  variables coded as numeric variables, change them to `factor`s or
  `ordered` factors.
- Spreadsheets also encourage the use of a "wide" data format,
  especially for longitudinal data.  Each row corresponds to an
  experimental unit and multiple observation occasions are
  represented in different columns.  The "long" format is
  preferred in `R`.
  

## Converting numeric variables to factors
  
- The `factor` (`ordered`) function creates a factor
  (ordered factor) from a vector.  Factor labels can be specified in
  the optional `labels` argument.
- Suppose the `spray` variable in the `InsectSprays`
  data was stored as numeric values $1, 2,\dots,6$.  We convert it
  back to a factor with `factor`.
  
```{r sprays}
str(sprays <- within(InsectSprays, spray <- as.integer(spray)))
str(sprays <- within(sprays, spray <- factor(spray, labels = LETTERS[1:6])))
``` 


# Subsets of data frames


## Subsets of data frames
  
- The `subset` function is used to extract a subset of the
  rows or of the columns or of both from a data frame.
- The first argument is the name of the data frame. The
  second is an expression indicating which rows are to be selected.
- This expression often uses logical operators such as
  `==`, the equality comparison, or `!=`, the inequality
  comparison, `>=`, meaning "greater than or equal to", etc.
```{r sprayA}
str(sprayA <- subset(sprays, spray == "A"))
```   
\item The optional argument `select` can be used to specify the
  variables to be included.

## Subsets and factors
  
- The way that factors are defined, a subset of a factor retains
  the original set of levels.  Usually this is harmless but
  sometimes it can cause unexpected results.
- You can "drop unused levels" by applying `factor` to
  the factor.  Many functions, such as `xtabs`, which is used to
  create cross-tabulations, have optional arguments with names like
  `drop.unused.levels` to automate this.
  
```{r xtabssprays}
xtabs( ~ spray, sprayA)
xtabs( ~ spray, sprayA, drop = TRUE)
``` 


## Dropping unused levels in the spray factor
```{r spraysdrop}
str(sprayA <- within(sprayA, spray <- factor(spray)))
xtabs( ~ spray, sprayA)
```   

## The `%in%` operator
\item Another useful comparison operator is `%in%` for
  selecting a subset of the values in a variable.

```{r sprayDEF}
str(sprayDEF <- subset(sprays, spray %in% c("D","E","F")))
``` 

## "Long" and "wide" forms of data
  
- Spreadsheet users tend to store balanced data, such as `InsectSprays`, across many columns.  This is called the "wide" format.  The `unstack` function converts a simple "long" data set to wide; `stack` for the other way.
```{r unstack}
str(unstack(InsectSprays))
```     
- The problem with the wide format is that it only works for balanced data.  A designed experiment may produce balanced data (although "Murphy's Law" would indicate otherwise) but observational data are rarely balanced. Use the long format when possible.

## Using reshape
  
- The `reshape` function allows for more general translations of long to wide and vice-versa.  It is specifically intended for longitudinal data.
- There is also a package called `"reshape"` with even more general (but potentially confusing) capabilities.
- Phil Spector's book, __Data Manipulation with R__ (Springer, 2008) covers this topic in more detail.
```{r classroomfactor,echo=FALSE,results='hide'}
classroom <- within(classroom,schoolid <- factor(schoolid))
```


## Determining unique rows in a data frame
  
- One disadvantage of keeping data in the long format is
  redundancy and the possibility of inconsistency.
- In the first set of exercises you are asked to create a data
  frame `classroom` from a csv file available on the Internet.
  Each of the `r nrow(classroom)` rows corresponds to a student
  in a classroom in a school.  There is one numeric "school level"
  covariate, `housepov`.
- To check if `housepov` is stored consistently we select
  the unique combinations of only those two columns
  
```{r clasuniq}
str(unique(subset(classroom, select = c(schoolid,housepov))))
``` 
Because there are 107 unique combinations and 107 schools,
`housepov` is consistent with `schoolid`.