AlexsLemonade · cansavvy · Sep 18, 2020 · Sep 10, 2020 · Sep 10, 2020 · Sep 10, 2020
diff --git a/02-microarray/differential-expression_microarray_01.html b/02-microarray/differential-expression_microarray_01.html
diff --git a/03-rnaseq/00-intro-to-rnaseq.Rmd b/03-rnaseq/00-intro-to-rnaseq.Rmd
@@ -6,3 +6,125 @@ output:
     toc: true
     toc_float: true
 ---
+
+<!-- START doctoc generated TOC please keep comment here to allow auto update -->
+<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
+**Table of Contents**  *generated with [DocToc](https://github.com/thlorenz/doctoc)*
+
+- [Introduction to RNA-seq technology](#introduction-to-rna-seq-technology)
+  - [RNA-seq data **strengths**:](#rna-seq-data-strengths)
+  - [RNA-seq data **limitations**:](#rna-seq-data-limitations)
+  - [About quantile normalization](#about-quantile-normalization)
+  - [More resources on RNA-seq technology:](#more-resources-on-rna-seq-technology)
+- [About DESeq2](#about-deseq2)
+  - [DESeq2 objects](#deseq2-objects)
+  - [DESeq2 normalization methods](#deseq2-normalization-methods)
+  - [Further resources for DESeq2](#further-resources-for-deseq2)
+    - [Why doesn't the gene I care about show up in this dataset?](#why-doesnt-the-gene-i-care-about-show-up-in-this-dataset)
+    - [Why do these examples use DESeq2 and not EdgeR or ____](#why-do-these-examples-use-deseq2-and-not-edger-or-____)
+    - [What if I care about isoforms?](#what-if-i-care-about-isoforms)
+- [References](#references)
+
+<!-- END doctoc generated TOC please keep comment here to allow auto update -->
+
+## Introduction to RNA-seq technology
+
+Data analyses are generally not "one size fits all"; this is particularly true as we are working with different technologies. 
+This tutorial has example analyses [organized by technology](../01-getting-started/getting-started.html#about-how-this-tutorial-book-is-structured) so you can follow examples that are more closely tailored to the nature of the data at hand. 
+
+As with all experimental methods, RNA-seq has strengths and limitations that you should consider in regards to your scientific questions. 
+
+### RNA-seq data **strengths**:  
+
+- RNA-seq can collect data on more transcripts (it is less bound to a pre-determined set of probes like microarray is). 
+- It's values are considered more dynamic than microarray values which are constrained to the number of probes.
+
+### RNA-seq data **limitations**:  
+
+The nature of RNA sequencing steps introduce several different kinds of biases:
+
+- **GC bias**: higher GC content sequences are less likely to be observed.  
+- **3' bias (positional bias)**: for most sequencing methods, the 3 prime end of transcripts are more likely to be observed.  
+- **Complexity bias**: some sequences are easier to be bound and amplified than others.   
+- **Library size or sequencing depth**: the total number of reads is not always equivalent between samples.  
+- **Gene length**: longer genes are more likely to be observed.   
+
+This figure from @Love2016 briefly summarizes some of these biases.   
+
+<img src="https://github.com/AlexsLemonade/refinebio-examples/raw/8e31eed8a70f6653dd263fc689bc230050a2b22d/components/figures/Love2016-fig1.png" width=700>
+
+Most normalization methods, including [refine.bio's processing methods](http://docs.refine.bio/en/latest/main_text.html#rna-seq-pipelines), attempt to mitigate these biases, but these biases can never be fully negated.
+Some of these biases have been addressed to the extent that they can by our refine.bio processing methods so you don't have to worry too much about them.
+In brief, refine.bio data is quantified by Salmon using their correction algorithms: [`--seqbias`](https://salmon.readthedocs.io/en/latest/salmon.html#seqbias) , [`--gcbias`](https://salmon.readthedocs.io/en/latest/salmon.html#gcbias), and [`--posBias`](https://salmon.readthedocs.io/en/latest/salmon.html#posbias).
+
+### About quantile normalization
+
+Refine.bio data is available for you [quantile normalized](https://en.wikipedia.org/wiki/Quantile_normalization), which can address some library size biases.
+But more often than not, our example modules will recommend using the option for downloading non-quantile normalized data (note that this is RNA-seq specific, and microarray data does not have this download option). 
+
+<img src="https://github.com/AlexsLemonade/refinebio-examples/raw/e140face75daa6d2c34e30a4755c362e6039a677/template/screenshots/skip-quantile-normalization.png" width=500>
+
+See here for more about the [quantile normalization process in refine.bio](http://docs.refine.bio/en/latest/main_text.html#quantile-normalization)
+
+### More resources on RNA-seq technology: 
+
+- [StatsQuest: A gentle introduction to RNA-seq](https://www.youtube.com/watch?v=tlf6wYJrwKY) [@Starmer2017-rnaseq].
+- [A general background on the wet lab methods of RNA-seq](https://bitesizebio.com/13542/what-everyone-should-know-about-rna-seq/) [@Hadfield2016].
+- [Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5143225/) [@Love2016].
+- [Biases in Illumina transcriptome sequencing caused by random hexamer priming](https://pdfs.semanticscholar.org/9d16/997f5de72d6c606fef3d673db70e5d1d8e1e.pdf?_ga=2.131436679.965169313.1600175795-124991789.1600175795) [@Hansen2010].
+- [Computation for RNA-seq and ChIP-seq studies](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4121056/) [@Pepke2009].
+
+## About DESeq2
+
+DESeq2 is an R package that can normalize and handle RNA-seq data @Love2014.
+Our refine.bio data is summarized to the gene-level with tximport before you download it @Soneson2015. 
+In general, our examples suggest you download the data [non-quantile normalized](#about-quantile-normalization) so you can instead normalize the data with DESeq2, which requires you provide counts and *not* a normalized or corrected value like [TPMs](https://www.youtube.com/watch?v=TTUrtCY2k-w).
+
+### DESeq2 objects
+
+Many R Bioconductor packages have specialized object types they want your data to be formatted as. 
+For DESeq2, before we can use a lot the special functions, we need to get our data into a [`DESeqDataSet` object](https://www.rdocumentation.org/packages/DESeq2/versions/1.12.3/topics/DESeqDataSet-class). 
+`DESeqDataSet` objects not only store your data, but additional transformations of your data, model information, etc. 
+
+From our refine.bio datasets, we will use a function `DESeqDataSetFromMatrix()` to create our [`DESeqDataSet` objects](https://www.rdocumentation.org/packages/DESeq2/versions/1.12.3/topics/DESeqDataSet-class). 
+
+### DESeq2 normalization methods
+
+Although DESeq2 has multiple normalization methods, we generally stick to `vst()` (Variance Stablizing Transformation) or `rlog()`. 
+Both methods are very similar, and correct for [library size differences]() but `rlog()` sometimes takes a bit longer to run. 
+If you end up using a larger dataset and `rlog()` normalization takes a bit too long, you can switch to using `vst()` with confidence since previous data shows they yield similiar results given the dataset is large enough. 
+For more about DESeq2 normalization, we highly recommend this [StatsQuest video]((https://www.youtube.com/watch?v=UFB993xufUU)) which explains it quite nicely [@Starmer2017-deseq2].
+
+### Further resources for DESeq2
+
+- [StatsQuest: RPKM, FPKM and TPM, Clearly Explained!!!](https://www.youtube.com/watch?v=TTUrtCY2k-w) [@Starmer2015].
+- [StatsQuest: DESeq2, part 1, Library Normalization](https://www.youtube.com/watch?v=UFB993xufUU) [@Starmer2017-deseq2].
+- [DESeq2 vignette: Analyzing RNA-seq data with DESeq2](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) [@Love2014].
+- [Beginner's guide to DESeq2](Bhttps://bioc.ism.ac.jp/packages/2.14/bioc/vignettes/DESeq2/inst/doc/beginner.pdf) [@Love2014-guide].
+- [Introduction to DGE - using DESeq2](https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html) [@dge-workshop].
+
+#### Why doesn't the gene I care about show up in this dataset?
+
+There can be a lot of reasons a particular gene might not show up in your refine.bio RNA-seq dataset, here are just a few reasons:  
+
+  - The gene was not actually expressed in the cells in the first place.    
+  - The gene was not expressed at high enough levels to be detectable.  
+  - The gene does not have an Ensembl ID according to the [version of the annotation](TODO: Put link to refine.bio docs FAQ) we used.   
+
+#### Why do these examples use DESeq2 and not EdgeR or ____
+
+In short, both EdgeR and DESeq2 are good options and we at the CCDL just went with one of our preferences! 
+[See this blog that summarizes these - by one of the creators of DESeq2](https://mikelove.wordpress.com/2016/09/28/deseq2-or-edger/), he agrees EdgeR is also great. 
+But we needed to pick something, and our [refine.bio data is processed by tximport](http://docs.refine.bio/en/latest/main_text.html#processing-information), another package by the same creators of DESeq2, the creators of both these packages designed these packages to all play well together, and [we also really like their documentation and tutorials](#further-resources-for-deseq2). 
+But in the end, we are human and have our own preferences. 
+
+If you have strong preferences for EdgeR, you can definitely use your refine.bio data with it, but we currently do not have examples of that. 
+In this case, we'd refer you to [EdgeR's users guide](http://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf) and wish you the best of luck on your data adventures! 
+
+#### What if I care about isoforms?
+
+Unfortunately at this time, all refine.bio data is summarized to the gene-level, and there's no great way to examine isoforms with this data. 
+If your research needs to know transcript isoform information, you may need to look elsewhere. 
+This [paper discusses some tools for these kinds of questions](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-4002-1) @Zhang2017.
+
+## References
diff --git a/03-rnaseq/00-intro-to-rnaseq.html b/03-rnaseq/00-intro-to-rnaseq.html
diff --git a/03-rnaseq/differential-expression_rnaseq_01.html b/03-rnaseq/differential-expression_rnaseq_01.html
diff --git a/components/figures/Love2016-fig1.png b/components/figures/Love2016-fig1.png
diff --git a/references.bib b/references.bib
@@ -56,6 +56,31 @@ @Manual{Gu2020
   url = {https://jokergoo.github.io/ComplexHeatmap-reference/book/}, 
 }
 
+@Website{Hadfield2016,
+  title = {An Introduction to RNA-seq},
+  author = {{James Hadfield}}, 
+  month = {July},
+  year = {2016},
+  url = {https://bitesizebio.com/13542/what-everyone-should-know-about-rna-seq/},
+}
+
+@Article{Hansen2010,
+   Author="Hansen, K. D.  and Brenner, S. E.  and Dudoit, S. ",
+   Title="{{B}iases in {I}llumina transcriptome sequencing caused by random hexamer priming}",
+   Journal="Nucleic Acids Res.",
+   Year="2010",
+   Volume="38",
+   Number="12",
+   Pages="e131",
+   Month="Jul"
+}
+
+@Website{dge-workshop,
+  title = {Introduction to DGE},
+  author = {{Harvard Chan Bioinformatics Core (HBC)}}, 
+  url = {https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html},
+}
+
 @Article{Huber2015,
    Author="Huber, W.  and Carey, V. J.  and Gentleman, R.  and Anders, S.  and Carlson, M.  and Carvalho, B. S.  and Bravo, H. C.  and Davis, S.  and Gatto, L.  and Girke, T.  and Gottardo, R.  and Hahne, F.  and Hansen, K. D.  and Irizarry, R. A.  and Lawrence, M.  and Love, M. I.  and MacDonald, J.  and Obenchain, V.  and Ole?, A. K.  and Pag?s, H.  and Reyes, A.  and Shannon, P.  and Smyth, G. K.  and Tenenbaum, D.  and Waldron, L.  and Morgan, M. ",
    Title="{{O}rchestrating high-throughput genomic analysis with {B}ioconductor}",
@@ -92,6 +117,32 @@ @Article{Love2014
   doi = {10.1186/s13059-014-0550-8},
 }
 
+@Article{Love2014-guide,
+  title = {Beginner’s guide to using the DESeq2 package},
+  author = {Michael I. Love, Simon Anders, and Wolfgang Huber},
+  year = {2014},
+  url = {https://bioc.ism.ac.jp/packages/2.14/bioc/vignettes/DESeq2/inst/doc/beginner.pdf},
+}
+
+@Article{Love2016,
+   Author="Love, M. I.  and Hogenesch, J. B.  and Irizarry, R. A. ",
+   Title="{{M}odeling of {R}{N}{A}-seq fragment sequence bias reduces systematic errors in transcript abundance estimation}",
+   Journal="Nat. Biotechnol.",
+   Year="2016",
+   Volume="34",
+   Number="12",
+   Pages="1287--1291",
+   Month="Dec"
+}
+
+@Website{Love2020,
+  title = {Analyzing RNA-seq data with DESeq2},
+  author = {{Michael I. Love, Simon Anders, and Wolfgang Huber}}, 
+  month = {May},
+  year = {2020},
+  url = {https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html},
+}
+
 @Article{Nguyen2019,
   title = {Ten quick tips for effective dimensionality reduction},
   author = {Lan Huong Nguyen and Susan Holmes},
@@ -109,6 +160,17 @@ @Website{pca-visually-explained
   url = {https://setosa.io/ev/principal-component-analysis/},
 }
 
+@Article{Pepke2009,
+   Author="Pepke, S.  and Wold, B.  and Mortazavi, A. ",
+   Title="{{C}omputation for {C}h{I}{P}-seq and {R}{N}{A}-seq studies}",
+   Journal="Nat. Methods",
+   Year="2009",
+   Volume="6",
+   Number="11 Suppl",
+   Pages="22--32",
+   Month="Nov"
+}
+
 @Manual{Prabhakaran2016,
   title = {The Complete ggplot2 Tutorial},
   author = {Selva Prabhakaran},
@@ -263,6 +325,40 @@ @Article{Slowikowski2017
   url = {https://slowkow.com/notes/pheatmap-tutorial/},
 }
 
+@Article{Soneson2015,
+    title = {Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences},
+    author = {Charlotte Soneson and Michael I. Love and Mark D. Robinson},
+    year = {2015},
+    journal = {F1000Research},
+    doi = {10.12688/f1000research.7563.1},
+    volume = {4},
+    issue = {1521},
+  }
+
+@Website{Starmer2015,
+  title = {RPKM, FPKM and TPM, Clearly Explained!!!},
+  author = {{Josh Starmer}}, 
+  month = {July},
+  year = {2015},
+  url = {https://www.youtube.com/watch?v=TTUrtCY2k-w},
+}
+
+@Website{Starmer2017-deseq2,
+  title = {StatQuest: DESeq2, part 1, Library Normalization},
+  author = {{Josh Starmer}}, 
+  month = {March},
+  year = {2017},
+  url = {https://www.youtube.com/watch?v=UFB993xufUU},
+}
+
+@Website{Starmer2017-rnaseq,
+  title = {StatQuest: A gentle introduction to RNA-seq},
+  author = {{Josh Starmer}}, 
+  month = {August},
+  year = {2017},
+  url = {https://www.youtube.com/watch?v=tlf6wYJrwKY},
+}
+
 @Article{Tregnago2016,
    Author="Tregnago, C.  and Manara, E.  and Zampini, M.  and Bisio, V.  and Borga, C.  and Bresolin, S.  and Aveic, S.  and Germano, G.  and Basso, G.  and Pigazzi, M. ",
    Title="{{C}{R}{E}{B} engages {C}/{E}{B}{P}Î´ to initiate leukemogenesis}",
@@ -300,6 +396,17 @@ @Manual{Wickham2020
   url = {https://CRAN.R-project.org/package=devtools},
 }
 
+@Article{Zhang2017,
+   Author="Zhang, C.  and Zhang, B.  and Lin, L. L.  and Zhao, S. ",
+   Title="{{E}valuation and comparison of computational tools for {R}{N}{A}-seq isoform quantification}",
+   Journal="BMC Genomics",
+   Year="2017",
+   Volume="18",
+   Number="1",
+   Pages="583",
+   Month="08"
+}
+
 @Article{Zhu2018,
     title = {Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences},
     author = {Anqi Zhu and Joseph G. Ibrahim and Michael I. Love},