03--parsing.Rmd

---
output: html_document
bibliography: "bibtexlib.bib"
---

```{r setup, include=FALSE}
source("style.R")
```

# Getting data into R

## Typical format for microbiome data

Most `r gloss$add('pipelines')` for processing high-throughput `r gloss$add('amplicon')` data, such as mothur, QIIME, and dada2, result in a matrix of read counts.
One dimension of this matrix (i.e. the rows or the columns) consists of `r gloss$add('Operational Taxonomic Units (OTUs)')`, `r gloss$add('phylotypes')`, or `r gloss$add('exact sequence variants (ESVs)')` (all ways to "bin" similar read sequences).
The other dimension consists of samples.
Different tools will expect/output different orientations of the matrix, but, in our case, columns are samples and rows are OTUs. Sometimes the `r gloss$add('Operational Taxonomic Units (OTUs)', show = 'OTU')` data and the abundance matrix are two separate tables.
There is usually another table with sample information in rows.
This makes it easy to add lots of additional sample data columns that can be used to subset the data. 
Each sample and OTU will have an unique ID.

```{r echo=FALSE}
knitr::include_graphics('./figure_sources/typical_input_format.png')
```


## Importing data into R

Importing data into R can be quite easy if the data is formatted well, but can be a very frustrating experience otherwise.
An example of well-formatted data is `r gloss$add('Comma-delimited text file', shown = '.csv (comma-separated value)')` or `r gloss$add('Tab-delimited text file', shown = '.tsv (tab-separated value)')` files, each with a single table and no additional comments or formatting (e.g. merged cells).
Either of these formats might also have a .txt extension (the extension does not really matter; its for humans, not computers).
For more information on correct data formatting, see the [data formating section](http://grunwaldlab.github.io/Reproducible-science-in-R/03--Data_formatting.html) of our [guide for reporducible research](http://grunwaldlab.github.io/Reproducible-science-in-R/index.html).
You should always import the raw output data whenever possible and avoid any "manual" (i.e. non-scripted) modification of the data, especially in programs like Excel, which are known to mangle data from time to time (@zeeberg2004mistaken).

Throughout this workshop, we will be using data from @wagner2016host, a study on the effects of plant age, genotype, and environment on the bacterial microbiome of [*Boechera stricta*](https://en.wikipedia.org/wiki/Boechera_stricta), a perennial herb in the mustard family.
Here is a photo of *Boechera stricta* taken by Mary Ellen Harte:

```{r echo=FALSE}
knitr::include_graphics('images/boechera_stricta.jpg')
```

@wagner2016host released their raw data with the article and it is available [here](http://datadryad.org/resource/doi:10.5061/dryad.g60r3) on [dryad](http://datadryad.org/).
This is a great example of how to share your raw data!

There are many `r gloss$add('function', shown = 'functions')` commonly used to read tabular data, including the `r gloss$add('base R')` ones like `read.table` and `read.csv`, but we will be using functions from the new [`readr` package](http://readr.tidyverse.org/articles/readr.html), which returns `r gloss$add('tibble', shown = 'tibbles')` instead of `data.frame`s (A "table" in R).

```{r}
library(readr) # Loads the readr package so we can use `read_tsv`
```

Tibbles are a type of `data.frame` with some fancier printing and more consistent behavior.
Click <a href="data/otuTable97.txt.bz2" download="data/otuTable97.txt.bz2">here</a> to download the OTU table.
Lets read in the raw OTU table first:

```{r message=FALSE, cache=TRUE}
otu_data <- read_tsv("data/otuTable97.txt.bz2") # You might need to change the path to the file
print(otu_data) # You can also enter just `otu_data` to print it
```

This is a big data set, with `r format(nrow(otu_data), big.mark = ",")` rows (OTUs) and `r format(ncol(otu_data), big.mark = ",")` columns (`r format(ncol(otu_data) - 1, big.mark = ",")` samples and an OTU ID).
If your computer cannot load this file, don't worry, we will provide a subset later for the rest of the workshop. 

In this data set, the taxonomic classifications for the OTUs are in a different file.
This information could have been included as additional columns in the OTU table and often is in other data sets.
Click <a href="data/taxAssignments97.txt" download="data/taxAssignments97.txt">here</a> to download the taxonomic classifications table.

```{r message=FALSE, cache=TRUE}
tax_data <- read_tsv("data/taxAssignments97.txt")
print(tax_data) # You can also enter `tax_data` to print it
```

Although these data are very well-formatted compared to most, there are still a few issues.
The "OTU ID" column contains a space in the name (hence the back ticks), which makes it a bit more annoying to work with in R.
More importantly, the OTU IDs in the taxonomy table are prefixed with "OTU_" and those in the OTU table are not, so we have to remove that prefix to make the two match up.
The functions `sub` and `gsub` are used to search and replace parts of text; `sub` replaces only the first match and `gsub` replaces all matches.
Replacing with nothing (`""`) effectively searches and deletes.

```{r}
tax_data$`OTU ID` <- sub(tax_data$`OTU ID`, # ` are needed because of the space
                         pattern = "OTU_", replacement = "")
print(tax_data) 
```

Although we could proceed with the analysis using separate OTU and taxonomy tables, lets combine them to simplify things.
Since the rows are in different order, we need to combine (aka "join") them based on their OTU ID.
We will use the [`dplyr` package](https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html) for this.

```{r message=FALSE}
library(dplyr) # Loads the dplyr package so we can use `left_join`
tax_data$`OTU ID` <- as.character(tax_data$`OTU ID`) # Must be same type for join to work
otu_data$OTU_ID <- as.character(otu_data$OTU_ID) # Must be same type for join to work
otu_data <- left_join(otu_data, tax_data,
                      by = c("OTU_ID" = "OTU ID")) # identifies cols with shared IDs
print(otu_data)
```

There are so many columns that all of them are not shown in the print out, but we can verify that they are there by looking at the last 10 column names: 

```{r}
tail(colnames(otu_data), n = 10) # `tail` returns the last n elements
```

Next, lets load the sample data.
Click <a href="data/SMD.txt" download="data/SMD.txt">here</a> to download the sample data table.

```{r message=FALSE}
sample_data <- read_tsv("data/SMD.txt",
                        col_types = "cccccccccccccccc") # each "c" means a column of "character"
print(sample_data) # You can also enter `sample_data` to print it
```

Note how the number of sample columns in `otu_data` is equal to the number of rows in `sample_data` and the columns names of `otu_data` appear in the "SampleID" column.
This means that the contents of `sample_data$SampleID` can be used to subset columns in the OTU table.

## Converting to the `taxmap` format

Although our data is now in R, it is not in a format that is specialized for community abundance data; all R knows is that you have a few big tables. 
Different R packages for community (e.g. microbiome) analysis expect data in different formats or `r gloss$add('class', shown = 'classes')`.
A class, in programming jargon, is a defined way to store data plus some functions designed to interact with that data.
When you format a specific data set in this way, we call it an `r gloss$add('object')` or an "instance" of the class.
Many R packages implement their own classes and functions to convert data to their format, whereas some packages use the classes defined in other packages.
There are a few options for how to store an abundance matrix classified by a taxonomy in R (e.g. `phyloseq` objects), but we will be using classes defined in the `taxa` package here.
The goal of the `taxa` package is to provide an all-purpose standard way of manipulation any type of information assigned to a taxonomy.
`Taxa` provides a set of flexible `r gloss$add('parsing', shown = 'parsers')` that should be able to read nearly any format, given the correct settings.
You can read more about parsing taxonomic data with `taxa` here: https://github.com/ropensci/taxa#parsing-data.
The taxonomic data we appended to the abundance matrix has the following form: 

```{r}
head(otu_data$taxonomy, 10)
```

Note that there are some odd aspects to the format that could make it challenging to parse:

* Some taxa have `r gloss$add('Taxonomic ranks', shown = 'ranks')` (e.g. "k__Bacteria") and some don't (e.g. "Unassigned" and "Root"). 
* Some taxa have ranks, but no names (e.g. "f__").

If we just consider the ranks to be a part of the taxon name, then its pretty easy to parse:

```{r}
library(taxa)
obj <- parse_tax_data(otu_data,
                      class_cols = "taxonomy", # The column in the input table
                      class_sep = ";") # What each taxon is seperated by
print(obj)
```

Above is the print out of a `taxmap` object.
The first line tells us that the OTUs are assigned to `r format(length(obj$taxon_names()), big.mark = ",")` unique taxa and lists their IDs and names.
These taxon IDs were generated automatically when converting to the `taxmap` format and were not in the original data set.
The second line describes how taxa relate to each other in the tree.
Note how our original data are now inside this object: 

```{r}
print(obj$data$tax_data)
```

`obj$data` is a list of arbitrary, user-defined data sets.
These data sets can be named anything and can be any R object, such as `r gloss$add('list', shown = 'lists')`, or `r gloss$add('vector', shown = 'vectors')`, or tables.
This is different than `phyloseq` objects, which have a fixed number of data sets in pre-defined formats, since the focus of `taxa` is taxonomic data in general and the focus of `phyloseq` is microbiome data in particular.
Note that our data set now has a "taxon_id" column, which associates rows in the table to a taxa in the taxonomy.
This column is essential for the manipulation functions of `taxa` to know how to handle these data sets, as we will demonstrate later.

If we want to split out the rank information while parsing, we can use `r gloss$add('regular expressions')` (a.k.a "regex") to specify which part of each taxon is a rank and which part is a name.
If you are not familiar with using regular expressions, this might be challenging to understand at first, but it is a very useful skill to have, so it is worth learning.
Most regular expressions are composed of a series of "what to match" followed by "how many times to match".
One regular expression that matches the pattern of taxon names is `^[a-z]{0,1}_{0,2}.*$`.
This might look intimidating, but we can break it down into understandable parts: 

* The `^` and `$` represent the start and end of the text respectively. If these were not there, then the pattern could match just a part of the text.
* The square brackets (e.g. `[a-z]`) specify a range of characters that can be matched. Likewise the `.` means match any character.
* The contents of the curly braces (e.g. `{0,1}`) indicate the number of time the preceding pattern can match. Likewise, the `*` means 0 or more matches. For example the part of the regex `^[a-z]{0,1}` means "match a character `a` through `z` that occurs at the start of the string either zero or one times". 
* Any text that is not a special regex character (e.g., `[` and `.`) matches itself, so the `_` matches a `_` in the text. To match characters like `[` in the text you "escape" them with `\\` (e.g., `\\[`).

The whole regex means the following in common English: 

"From the start of the string, (`^`) match any character between "a" and "z" (`[a-z]`) zero or one times (`{0,1}`) followed by an underscore (`_`) occurring between zero and 2 times  (`{0,2}`), followed by any character (`.`) occurring zero or more times (`*`), followed by the end of the text (`$`)."

We can add parentheses that specify which parts of the pattern go together; these are called `r gloss$add('capture groups')` in regex jargon.
These do not change what will be matched; they just define different parts of the pattern.
In this case, we are interested in the taxon rank (matched by `([a-z]{0,1})`) and the taxon name (matched by `(.*)`).

The `parse_tax_data` function from the `taxa` package uses regular expressions with capture groups to isolate the pieces of information we want.
For each capture group in the regular expression (aka "regex"), a value is given to the `class_key` option specifying what the group is (e.g. taxon name).
Putting this all together, we can read the data like so:

```{r}
obj <- parse_tax_data(otu_data,
                      class_cols = "taxonomy",
                      class_sep = ";",
                      class_regex = "^([a-z]{0,1})_{0,2}(.*)$",
                      class_key = c("tax_rank" = "taxon_rank", "name" = "taxon_name"))
print(obj)
```

Note how the taxon names do not have the rank information anymore:

```{r}
head(taxon_names(obj))
```

Instead, the rank information (and any other capture group content) is in a separate data set:

```{r}
obj$data$class_data
```

However, ranks can also be accessed using the `taxon_ranks` function: 

```{r}
head(taxon_ranks(obj))
```

So we don't really need the "class_data" table, so lets get rid of it:

```{r}
obj$data$class_data <- NULL
```

Lets also rename the "tax_data" table to something more informative:

```{r}
names(obj$data) <- "otu_counts"
print(obj)
```

```{r echo = FALSE}
save(obj, sample_data, file = "parsed_data.Rdata")
```

We can name the tables or other info in `obj$data` whatever we want.
`obj$data` is a standard `list`, which means any number of things of any type can be put in it.

## Exercises

### Reading tabular data

**1)** Take a look at the file [example_data_1.tsv](data/example_data_1.tsv) (click to download).

**2)** Try reading the file using the base R function `read.table`. You might need to change some of the options. Type `?read.table` to see the documentation for this function. Hint: what does the `.tsv` extension mean?

```{r hide_button = TRUE}
table_1 <- read.table("data/example_data_1.tsv", header = TRUE, sep = "\t")
```

**3)** Now try reading the same file using the `read_tsv` function from the `readr` package. Type `?read_tsv` to see the documentation for this function.

```{r hide_button = TRUE}
library(readr)
table_2 <- read_tsv("data/example_data_1.tsv")
```

**4)** Compare the two results. What is different about them? Try using the `str` function on each result to see the details about how they are formatted.

```{r hide_button = TRUE}
str(table_1)
str(table_2)
```

**5)** Try to change the input parameters of both functions so that all columns are of type `character` ("chr" in the `str` output).

```{r hide_button = TRUE}
table_1 <- read.table("data/example_data_1.tsv",
                      header = TRUE, sep = "\t", colClasses = "character")
table_2 <- read_tsv("data/example_data_1.tsv", col_types = "cccc")
str(table_1)
str(table_2)
```

### Reading taxonomic data

In this exercise we will be converting the table from the previous exercise to a `taxmap` object.

```{r}
my_data <- read_tsv("data/example_data_1.tsv")
print(my_data)
```

**6)** Look at the documentation for `parse_tax_data` and `lookup_tax_data`. The examples at the bottom of the documentation should be helpful.

```{r hide_button = TRUE, eval = FALSE}
library(taxa)
?parse_tax_data
?lookup_tax_data
```

**7)** Try using `parse_tax_data` to convert the table to a `taxmap` object, using the `my_taxonomy` column for the taxonomic information. You should get a taxonomy with "mammalia" at the root. How does the table included in the output differ from the input table?

```{r hide_button = TRUE}
parse_tax_data(my_data, class_cols = "my_taxonomy", class_sep = ", ")
```

**8)** Try using `lookup_tax_data` to convert the table to a `taxmap` object, using the `ncbi_seq_id` column to look up the taxonomy associated with these Genbank accession numbers. NOTE: This requires an internet connection.

```{r hide_button = TRUE, message=FALSE}
lookup_tax_data(my_data, type = "seq_id", column = "ncbi_seq_id", database = "ncbi")
```

**9)** Try using `lookup_tax_data` to convert the table to a `taxmap` object, using the `itis_taxon_id` column to look up the taxonomic classifications from the Integrated Taxonomic Information System (ITIS) for these taxon IDs. NOTE: This requires an internet connection.

```{r hide_button = TRUE, message=FALSE}
lookup_tax_data(my_data, type = "taxon_id", column = "itis_taxon_id", database = "itis")
```

**10)** Compare the results of the three sources of taxonomic information. What is different? What is the same? 

```{r hide_button = "Show Answer", results = 'asis', echo = FALSE}
cat(
"* The number and name of taxa are different since each source has a different taxonomy.
* The taxon IDs are different. When an online database is queried, the taxon IDs from that database are used.
* The input data stored in 'tax_data' is the same, except for the taxon ID column."
)
```

### Reading taxonomic data from complex formats

Sometimes taxonomic data can be embedded in complex text, like FASTA headers.
Look at the file [example_data_2.fa](data/example_data_2.fa) (click to download) which contains headers with two sources of taxonomic information:

* The Genbank accession number
* The taxonomic classification

**11)** Read the FASTA file [example_data_2.fa](data/example_data_2.fa) using the `read.FASTA` function from the `ape` package and store the result in a variable.

```{r hide_button = TRUE}
library(ape)
seqs <- read.FASTA("data/example_data_2.fa")
print(seqs)
```

**12)** Use the `names` function to get the headers and stores those in another variable.

```{r hide_button = TRUE}
headers <- names(seqs)
print(headers)
```

**13)** Look at the documentation for `extract_tax_data` from the `taxa` package by typing `?extract_tax_data`.

**14)** Use the `extract_tax_data` function to convert the headers to a `taxmap` object, using the NCBI accession number. The classification can be ignored (i.e. not given a capture group and key value) or stored as `"info"`.

```{r hide_button = TRUE, message = FALSE}
extract_tax_data(headers,
                 key = c(my_acc_no = "seq_id", my_class = "info"),
                 regex = "^(.+)::(.+)$")
```

**15)** Now use the classification instead of the accession numbers. The accession number can be ignored or stored as `"info"`. Consider the rank as part of the taxon name for now.

```{r hide_button = TRUE}
extract_tax_data(headers,
                 key = c(my_acc_no = "info", my_class = "class"),
                 regex = "^(.+)::(.+)$",
                 class_sep = ";")
```

**16)** Now try to split out the rank information from the taxon name. You will need to use the `class_key` and `class_regex` options. The ranks can be stored as `"info"` or `"taxon_rank`". Using `"taxon_rank`" allows the ranks to be accessed with the `taxon_ranks` function.

```{r hide_button = TRUE}
extract_tax_data(headers,
                 key = c(acc_no = "info", my_class = "class"),
                 regex = "^(.+)::(.+)$",
                 class_sep = ";",
                 class_key = c(tax_rank = "info", name = "taxon_name"),
                 class_regex = "^(.+)_(.+)$")
```

**17)** Sometimes you will have the option to use more than one source of taxonomic information, like in this exercise. What are the benefits of using the embedded classification information instead of a Genbank accession number or taxon ID?

```{r hide_button = "Show Answer", results = 'asis', echo = FALSE}
cat(
  "Using the included classification does not require communicating with online databases, so is much faster and more reliable."
)
```


## References