Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RNA-Seq Header Section #216

Merged
merged 28 commits into from
Sep 18, 2020
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
bf8d000
Get it started
cansavvy Sep 10, 2020
c278547
I put words down
cansavvy Sep 10, 2020
3d1ddef
moar words and links
cansavvy Sep 10, 2020
8e31eed
More words and citations
cansavvy Sep 15, 2020
29b73ac
RNA-seq header section a bit more polished
cansavvy Sep 15, 2020
a0663af
add figure in
cansavvy Sep 15, 2020
2fa1990
Few tiny edits
cansavvy Sep 15, 2020
19fb0ac
Incorporate @cbethell review
cansavvy Sep 16, 2020
e86cb52
Fix one little wording change
cansavvy Sep 16, 2020
a63f03e
Put a TODO for that one link
cansavvy Sep 17, 2020
7e3950a
Incorporate most of the comments in Jackie's review
cansavvy Sep 17, 2020
493f749
Re-render
cansavvy Sep 17, 2020
c3e058e
Re-render after fixing references.bib
cansavvy Sep 17, 2020
5ed3afd
More wording changes
cansavvy Sep 17, 2020
c93c3c9
Doctoc and re-render
cansavvy Sep 17, 2020
08eae92
rearrange wording about normalization
cansavvy Sep 17, 2020
f024ffd
Re-render
cansavvy Sep 17, 2020
4778a09
A few more minor edits
cansavvy Sep 17, 2020
bf9a12e
Just a few more wording edits and sentence rearrangments
cansavvy Sep 17, 2020
be1161b
Merge branch 'master' into cansavvy/rna-seq-header
cansavvy Sep 17, 2020
8904ffc
Merge origin/master into cansavvy/rna-seq-header
jaclyn-taroni Sep 18, 2020
779779d
Alphabetical order after resolving conflicts
jaclyn-taroni Sep 18, 2020
ddfcef8
Merge branch 'master' into cansavvy/rna-seq-header
jaclyn-taroni Sep 18, 2020
5a0fb1c
Few smaller changes and rerender everything
cansavvy Sep 18, 2020
56e77e1
Add links to the RNA-seq header section
cansavvy Sep 18, 2020
ab0c39a
Merge branch 'master' into cansavvy/rna-seq-header
cansavvy Sep 18, 2020
f176ef0
Get rid of the one typo Jackie found
cansavvy Sep 18, 2020
bea4e3f
Merge remote-tracking branch 'origin/cansavvy/rna-seq-header' into ca…
cansavvy Sep 18, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion 02-microarray/differential-expression_microarray_01.html

Large diffs are not rendered by default.

122 changes: 122 additions & 0 deletions 03-rnaseq/00-intro-to-rnaseq.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,125 @@ output:
toc: true
toc_float: true
---

<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
**Table of Contents** *generated with [DocToc](https://github.com/thlorenz/doctoc)*

- [Introduction to RNA-seq technology](#introduction-to-rna-seq-technology)
- [RNA-seq data **strengths**:](#rna-seq-data-strengths)
- [RNA-seq data **limitations**:](#rna-seq-data-limitations)
- [About quantile normalization](#about-quantile-normalization)
- [More resources on RNA-seq technology:](#more-resources-on-rna-seq-technology)
- [About DESeq2](#about-deseq2)
- [DESeq2 objects](#deseq2-objects)
- [DESeq2 normalization methods](#deseq2-normalization-methods)
- [Further resources for DESeq2](#further-resources-for-deseq2)
- [Why doesn't the gene I care about show up in this dataset?](#why-doesnt-the-gene-i-care-about-show-up-in-this-dataset)
- [Why do these examples use DESeq2 and not EdgeR or ____](#why-do-these-examples-use-deseq2-and-not-edger-or-____)
- [What if I care about isoforms?](#what-if-i-care-about-isoforms)
- [References](#references)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->

## Introduction to RNA-seq technology

Data analyses are generally not "one size fits all"; this is particularly true as we are working with different technologies.
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
This tutorial has example analyses [organized by technology](../01-getting-started/getting-started.html#about-how-this-tutorial-book-is-structured) so you can follow examples that are more closely tailored to the nature of the data at hand.

As with all experimental methods, RNA-seq has strengths and limitations that you should consider in regards to your scientific questions.

### RNA-seq data **strengths**:

- RNA-seq can collect data on more transcripts (it is less bound to a pre-determined set of probes like microarray is).
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
- It's values are considered more dynamic than microarray values which are constrained to the number of probes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the dynamic range of values? Do you have a citation for microarray values which are constrained to the number of probes?


### RNA-seq data **limitations**:
cansavvy marked this conversation as resolved.
Show resolved Hide resolved

The nature of RNA sequencing steps introduce several different kinds of biases:
cansavvy marked this conversation as resolved.
Show resolved Hide resolved

- **GC bias**: higher GC content sequences are less likely to be observed.
- **3' bias (positional bias)**: for most sequencing methods, the 3 prime end of transcripts are more likely to be observed.
- **Complexity bias**: some sequences are easier to be bound and amplified than others.
- **Library size or sequencing depth**: the total number of reads is not always equivalent between samples.
- **Gene length**: longer genes are more likely to be observed.

This figure from @Love2016 briefly summarizes some of these biases.

<img src="https://github.com/AlexsLemonade/refinebio-examples/raw/8e31eed8a70f6653dd263fc689bc230050a2b22d/components/figures/Love2016-fig1.png" width=700>
cansavvy marked this conversation as resolved.
Show resolved Hide resolved

Most normalization methods, including [refine.bio's processing methods](http://docs.refine.bio/en/latest/main_text.html#rna-seq-pipelines), attempt to mitigate these biases, but these biases can never be fully negated.
Some of these biases have been addressed to the extent that they can by our refine.bio processing methods so you don't have to worry too much about them.
In brief, refine.bio data is quantified by Salmon using their correction algorithms: [`--seqbias`](https://salmon.readthedocs.io/en/latest/salmon.html#seqbias) , [`--gcbias`](https://salmon.readthedocs.io/en/latest/salmon.html#gcbias), and [`--posBias`](https://salmon.readthedocs.io/en/latest/salmon.html#posbias).
cansavvy marked this conversation as resolved.
Show resolved Hide resolved

### About quantile normalization

Refine.bio data is available for you [quantile normalized](https://en.wikipedia.org/wiki/Quantile_normalization), which can address some library size biases.
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
But more often than not, our example modules will recommend using the option for downloading non-quantile normalized data (note that this is RNA-seq specific, and microarray data does not have this download option).

<img src="https://github.com/AlexsLemonade/refinebio-examples/raw/e140face75daa6d2c34e30a4755c362e6039a677/template/screenshots/skip-quantile-normalization.png" width=500>

See here for more about the [quantile normalization process in refine.bio](http://docs.refine.bio/en/latest/main_text.html#quantile-normalization)

### More resources on RNA-seq technology:
cansavvy marked this conversation as resolved.
Show resolved Hide resolved

- [StatsQuest: A gentle introduction to RNA-seq](https://www.youtube.com/watch?v=tlf6wYJrwKY) [@Starmer2017-rnaseq].
- [A general background on the wet lab methods of RNA-seq](https://bitesizebio.com/13542/what-everyone-should-know-about-rna-seq/) [@Hadfield2016].
- [Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5143225/) [@Love2016].
- [Biases in Illumina transcriptome sequencing caused by random hexamer priming](https://pdfs.semanticscholar.org/9d16/997f5de72d6c606fef3d673db70e5d1d8e1e.pdf?_ga=2.131436679.965169313.1600175795-124991789.1600175795) [@Hansen2010].
- [Computation for RNA-seq and ChIP-seq studies](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4121056/) [@Pepke2009].

## About DESeq2

DESeq2 is an R package that can normalize and handle RNA-seq data @Love2014.
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
Our refine.bio data is summarized to the gene-level with tximport before you download it @Soneson2015.
In general, our examples suggest you download the data [non-quantile normalized](#about-quantile-normalization) so you can instead normalize the data with DESeq2, which requires you provide counts and *not* a normalized or corrected value like [TPMs](https://www.youtube.com/watch?v=TTUrtCY2k-w).

### DESeq2 objects

Many R Bioconductor packages have specialized object types they want your data to be formatted as.
For DESeq2, before we can use a lot the special functions, we need to get our data into a [`DESeqDataSet` object](https://www.rdocumentation.org/packages/DESeq2/versions/1.12.3/topics/DESeqDataSet-class).
`DESeqDataSet` objects not only store your data, but additional transformations of your data, model information, etc.

From our refine.bio datasets, we will use a function `DESeqDataSetFromMatrix()` to create our [`DESeqDataSet` objects](https://www.rdocumentation.org/packages/DESeq2/versions/1.12.3/topics/DESeqDataSet-class).

### DESeq2 normalization methods

Although DESeq2 has multiple normalization methods, we generally stick to `vst()` (Variance Stablizing Transformation) or `rlog()`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably call these transformations, not normalization. You can normalize (e.g., adjust for size factors; counts(<dataset>, normalize = TRUE)) without transforming. This could be confusing for someone coming in with some level of experience. Also should talk about what these are specifically doing beyond that normalization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is more general and probably should be applied in other RNA-seq notebooks (e.g., 03-rnaseq/dimension_reduction_rnaseq_01_pca.Rmd), too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed: #220

Both methods are very similar, and correct for [library size differences]() but `rlog()` sometimes takes a bit longer to run.
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
If you end up using a larger dataset and `rlog()` normalization takes a bit too long, you can switch to using `vst()` with confidence since previous data shows they yield similiar results given the dataset is large enough.
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
For more about DESeq2 normalization, we highly recommend this [StatsQuest video]((https://www.youtube.com/watch?v=UFB993xufUU)) which explains it quite nicely [@Starmer2017-deseq2].

### Further resources for DESeq2
cansavvy marked this conversation as resolved.
Show resolved Hide resolved

- [StatsQuest: RPKM, FPKM and TPM, Clearly Explained!!!](https://www.youtube.com/watch?v=TTUrtCY2k-w) [@Starmer2015].
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
- [StatsQuest: DESeq2, part 1, Library Normalization](https://www.youtube.com/watch?v=UFB993xufUU) [@Starmer2017-deseq2].
- [DESeq2 vignette: Analyzing RNA-seq data with DESeq2](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) [@Love2014].
- [Beginner's guide to DESeq2](Bhttps://bioc.ism.ac.jp/packages/2.14/bioc/vignettes/DESeq2/inst/doc/beginner.pdf) [@Love2014-guide].
- [Introduction to DGE - using DESeq2](https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html) [@dge-workshop].

#### Why doesn't the gene I care about show up in this dataset?
cansavvy marked this conversation as resolved.
Show resolved Hide resolved

There can be a lot of reasons a particular gene might not show up in your refine.bio RNA-seq dataset, here are just a few reasons:
cansavvy marked this conversation as resolved.
Show resolved Hide resolved

- The gene was not actually expressed in the cells in the first place.
- The gene was not expressed at high enough levels to be detectable.
- The gene does not have an Ensembl ID according to the [version of the annotation](TODO: Put link to refine.bio docs FAQ) we used.

#### Why do these examples use DESeq2 and not EdgeR or ____
cansavvy marked this conversation as resolved.
Show resolved Hide resolved

In short, both EdgeR and DESeq2 are good options and we at the CCDL just went with one of our preferences!
[See this blog that summarizes these - by one of the creators of DESeq2](https://mikelove.wordpress.com/2016/09/28/deseq2-or-edger/), he agrees EdgeR is also great.
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
But we needed to pick something, and our [refine.bio data is processed by tximport](http://docs.refine.bio/en/latest/main_text.html#processing-information), another package by the same creators of DESeq2, the creators of both these packages designed these packages to all play well together, and [we also really like their documentation and tutorials](#further-resources-for-deseq2).
But in the end, we are human and have our own preferences.
cansavvy marked this conversation as resolved.
Show resolved Hide resolved

If you have strong preferences for EdgeR, you can definitely use your refine.bio data with it, but we currently do not have examples of that.
In this case, we'd refer you to [EdgeR's users guide](http://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf) and wish you the best of luck on your data adventures!
cansavvy marked this conversation as resolved.
Show resolved Hide resolved

#### What if I care about isoforms?

Unfortunately at this time, all refine.bio data is summarized to the gene-level, and there's no great way to examine isoforms with this data.
If your research needs to know transcript isoform information, you may need to look elsewhere.
This [paper discusses some tools for these kinds of questions](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-4002-1) @Zhang2017.
cansavvy marked this conversation as resolved.
Show resolved Hide resolved

## References
156 changes: 155 additions & 1 deletion 03-rnaseq/00-intro-to-rnaseq.html

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion 03-rnaseq/differential-expression_rnaseq_01.html

Large diffs are not rendered by default.

Binary file added components/figures/Love2016-fig1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
107 changes: 107 additions & 0 deletions references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,31 @@ @Manual{Gu2020
url = {https://jokergoo.github.io/ComplexHeatmap-reference/book/},
}

@Website{Hadfield2016,
title = {An Introduction to RNA-seq},
author = {{James Hadfield}},
month = {July},
year = {2016},
url = {https://bitesizebio.com/13542/what-everyone-should-know-about-rna-seq/},
}

@Article{Hansen2010,
Author="Hansen, K. D. and Brenner, S. E. and Dudoit, S. ",
Title="{{B}iases in {I}llumina transcriptome sequencing caused by random hexamer priming}",
Journal="Nucleic Acids Res.",
Year="2010",
Volume="38",
Number="12",
Pages="e131",
Month="Jul"
}

@Website{dge-workshop,
title = {Introduction to DGE},
author = {{Harvard Chan Bioinformatics Core (HBC)}},
url = {https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html},
}

@Article{Huber2015,
Author="Huber, W. and Carey, V. J. and Gentleman, R. and Anders, S. and Carlson, M. and Carvalho, B. S. and Bravo, H. C. and Davis, S. and Gatto, L. and Girke, T. and Gottardo, R. and Hahne, F. and Hansen, K. D. and Irizarry, R. A. and Lawrence, M. and Love, M. I. and MacDonald, J. and Obenchain, V. and Ole?, A. K. and Pag?s, H. and Reyes, A. and Shannon, P. and Smyth, G. K. and Tenenbaum, D. and Waldron, L. and Morgan, M. ",
Title="{{O}rchestrating high-throughput genomic analysis with {B}ioconductor}",
Expand Down Expand Up @@ -92,6 +117,32 @@ @Article{Love2014
doi = {10.1186/s13059-014-0550-8},
}

@Article{Love2014-guide,
title = {Beginner’s guide to using the DESeq2 package},
author = {Michael I. Love, Simon Anders, and Wolfgang Huber},
year = {2014},
url = {https://bioc.ism.ac.jp/packages/2.14/bioc/vignettes/DESeq2/inst/doc/beginner.pdf},
}

@Article{Love2016,
Author="Love, M. I. and Hogenesch, J. B. and Irizarry, R. A. ",
Title="{{M}odeling of {R}{N}{A}-seq fragment sequence bias reduces systematic errors in transcript abundance estimation}",
Journal="Nat. Biotechnol.",
Year="2016",
Volume="34",
Number="12",
Pages="1287--1291",
Month="Dec"
}

@Website{Love2020,
title = {Analyzing RNA-seq data with DESeq2},
author = {{Michael I. Love, Simon Anders, and Wolfgang Huber}},
month = {May},
year = {2020},
url = {https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html},
}

@Article{Nguyen2019,
title = {Ten quick tips for effective dimensionality reduction},
author = {Lan Huong Nguyen and Susan Holmes},
Expand All @@ -109,6 +160,17 @@ @Website{pca-visually-explained
url = {https://setosa.io/ev/principal-component-analysis/},
}

@Article{Pepke2009,
Author="Pepke, S. and Wold, B. and Mortazavi, A. ",
Title="{{C}omputation for {C}h{I}{P}-seq and {R}{N}{A}-seq studies}",
Journal="Nat. Methods",
Year="2009",
Volume="6",
Number="11 Suppl",
Pages="22--32",
Month="Nov"
}

@Manual{Prabhakaran2016,
title = {The Complete ggplot2 Tutorial},
author = {Selva Prabhakaran},
Expand Down Expand Up @@ -263,6 +325,40 @@ @Article{Slowikowski2017
url = {https://slowkow.com/notes/pheatmap-tutorial/},
}

@Article{Soneson2015,
title = {Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences},
author = {Charlotte Soneson and Michael I. Love and Mark D. Robinson},
year = {2015},
journal = {F1000Research},
doi = {10.12688/f1000research.7563.1},
volume = {4},
issue = {1521},
}

@Website{Starmer2015,
title = {RPKM, FPKM and TPM, Clearly Explained!!!},
author = {{Josh Starmer}},
month = {July},
year = {2015},
url = {https://www.youtube.com/watch?v=TTUrtCY2k-w},
}

@Website{Starmer2017-deseq2,
title = {StatQuest: DESeq2, part 1, Library Normalization},
author = {{Josh Starmer}},
month = {March},
year = {2017},
url = {https://www.youtube.com/watch?v=UFB993xufUU},
}

@Website{Starmer2017-rnaseq,
title = {StatQuest: A gentle introduction to RNA-seq},
author = {{Josh Starmer}},
month = {August},
year = {2017},
url = {https://www.youtube.com/watch?v=tlf6wYJrwKY},
}

@Article{Tregnago2016,
Author="Tregnago, C. and Manara, E. and Zampini, M. and Bisio, V. and Borga, C. and Bresolin, S. and Aveic, S. and Germano, G. and Basso, G. and Pigazzi, M. ",
Title="{{C}{R}{E}{B} engages {C}/{E}{B}{P}δ to initiate leukemogenesis}",
Expand Down Expand Up @@ -300,6 +396,17 @@ @Manual{Wickham2020
url = {https://CRAN.R-project.org/package=devtools},
}

@Article{Zhang2017,
Author="Zhang, C. and Zhang, B. and Lin, L. L. and Zhao, S. ",
Title="{{E}valuation and comparison of computational tools for {R}{N}{A}-seq isoform quantification}",
Journal="BMC Genomics",
Year="2017",
Volume="18",
Number="1",
Pages="583",
Month="08"
}

@Article{Zhu2018,
title = {Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences},
author = {Anqi Zhu and Joseph G. Ibrahim and Michael I. Love},
Expand Down