From 53a058ac1065c208716d7cc7062649f5e25ed048 Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Thu, 1 Oct 2020 07:48:55 -0400 Subject: [PATCH 1/7] Add the obvious words to dictionary.txt --- components/dictionary.txt | 76 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 76 insertions(+) diff --git a/components/dictionary.txt b/components/dictionary.txt index fbe49b63..cc44bbe5 100644 --- a/components/dictionary.txt +++ b/components/dictionary.txt @@ -1,11 +1,87 @@ +actin +ADT al. +AnaLysis +Azacytidine +Benjamini +bioconductor +CCDL +CCDL's +cheatsheet +cheatsheets +ChIP +ColorBrewer +ComplexHeatmap +CREB +cytosolic +Danio dataset dataset's DESeq DESeq2 +directionality +DocToc +ENSDARG +Ensembl +ENSG +Entrez +et et. +FACS +GEne +genesets +generalizable +ggplot +GitHub +glioma +GSE +GSEA +hexamer +HGNC +histological +Hochberg +hypomethylating +Illumina +isoform +isoforms +jpeg +KEGG +limma +medulloblastoma +musculus +MSigDB +myeloid +ortholog +orthologs +overexpressing +overexpression +pheatmap +PLoS PNG +probesets +prostatectomy `.Rmd` +QuSAGE +README +rerio +ribosomes +RPKMs RStudio +SeT +ssGSEA +StatQuest tidyverse +TPM +TPMs +transcriptional +transcriptome +transcriptomic TSV +tximport +UMAP +unmapped +upregulated +WebGestalt +WebGestaltR +WNT +zebrafish From 8595f8a7589e0230f6968948b42523b4c249d6f2 Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Thu, 1 Oct 2020 08:56:50 -0400 Subject: [PATCH 2/7] A bunch of spelling errors fixed and remove template from spell check --- .github/PULL_REQUEST_TEMPLATE.md | 2 +- .github/workflows/style-and-sp-check.yml | 2 +- 01-getting-started/getting-started.Rmd | 8 +- 01-getting-started/getting-started.html | 10 +- .../clustering_microarray_01_heatmap.Rmd | 54 +++---- .../clustering_microarray_01_heatmap.html | 12 +- ...tial-expression_microarray_01_2-groups.Rmd | 6 +- ...ial-expression_microarray_01_2-groups.html | 24 +-- ...xpression_microarray_02_several-groups.Rmd | 6 +- ...pression_microarray_02_several-groups.html | 22 +-- .../dimension-reduction_microarray_01_pca.Rmd | 4 +- ...dimension-reduction_microarray_01_pca.html | 12 +- ...dimension-reduction_microarray_02_umap.Rmd | 4 +- ...imension-reduction_microarray_02_umap.html | 34 ++--- ...ne-id-annotation_microarray_01_ensembl.Rmd | 6 +- ...e-id-annotation_microarray_01_ensembl.html | 14 +- .../ortholog_mapping_microarray_01.Rmd | 44 +++--- .../pathway_analysis_microarray_00_intro.Rmd | 46 +++--- ...is_microarray_01_ortholog_mapping_kegg.Rmd | 2 +- ...sis_microarray_02_ora_with_webgestaltr.Rmd | 46 +++--- ...sis_microarray_03_qusage_meta_analysis.Rmd | 6 +- ...icroarray_04_qusage_replicate_vignette.Rmd | 2 +- ...is_microarray_05_qusage_single_dataset.Rmd | 52 +++---- .../pathway_analysis_microarray_06_ssgsea.Rmd | 4 +- 03-rnaseq/00-intro-to-rnaseq.Rmd | 6 +- 03-rnaseq/00-intro-to-rnaseq.html | 6 +- 03-rnaseq/clustering_rnaseq_01_heatmap.Rmd | 18 +-- 03-rnaseq/clustering_rnaseq_01_heatmap.html | 18 +-- .../differential-expression_rnaseq_01.Rmd | 2 +- .../differential-expression_rnaseq_01.html | 14 +- .../dimension-reduction_rnaseq_01_pca.Rmd | 6 +- .../dimension-reduction_rnaseq_01_pca.html | 16 +- .../dimension-reduction_rnaseq_02_umap.Rmd | 48 +++--- .../dimension-reduction_rnaseq_02_umap.html | 14 +- .../gene-id-annotation_rnaseq_01_ensembl.Rmd | 4 +- .../gene-id-annotation_rnaseq_01_ensembl.html | 12 +- ...ntile_normalize_own_data_adv_topics_01.Rmd | 2 +- ...ial_expression_adv_topics_00_author_de.Rmd | 28 ++-- ..._differential_expression_adv_topics_01.Rmd | 143 +++++++++--------- CONTRIBUTING.md | 8 +- components/dictionary.txt | 46 ++++++ scripts/spell-check.R | 7 +- template/template_example.Rmd | 4 +- 43 files changed, 435 insertions(+), 389 deletions(-) diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index b60d3c2e..8da30ead 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -17,7 +17,7 @@ What things should reviewers look out for? ### Formatting Checks * [ ] Removed any manual numbering of sections. * [ ] Removed any instances of chunk naming. -* [ ] Spell checked any Rmd file or md file. +* [ ] Spell checked any `.Rmd` file or md file. * [ ] Comments and documentation are up to date. * [ ] All links have been checked and are properly formatted. diff --git a/.github/workflows/style-and-sp-check.yml b/.github/workflows/style-and-sp-check.yml index 218010c6..fc20d302 100644 --- a/.github/workflows/style-and-sp-check.yml +++ b/.github/workflows/style-and-sp-check.yml @@ -45,7 +45,7 @@ jobs: result: ${{ steps.spell_check_run.outputs.sp_chk_results }} run: | cat spell_check_results.tsv - if (( $result > 340 )); then + if (( $result > 2 )); then echo Too many spelling errors, $result exit 1 fi diff --git a/01-getting-started/getting-started.Rmd b/01-getting-started/getting-started.Rmd index 92f3d97f..d69bd2e9 100644 --- a/01-getting-started/getting-started.Rmd +++ b/01-getting-started/getting-started.Rmd @@ -31,7 +31,7 @@ This tutorial has follow-along examples for use with refine.bio downloads. ## About how this tutorial book is structured -This tutorial contains follow-along analysis examples for refinebio gene expression data. +This tutorial contains follow-along analysis examples for refine.bio gene expression data. The analysis examples are organized by technology: ["microarray"](https://alexslemonade.github.io/refinebio-examples/02-microarray/intro-to-microarray.nb.html) or ["RNA-seq"](https://alexslemonade.github.io/refinebio-examples/03-rna-seq/intro-to-rna-seq.nb.html), in addition to an ["Advanced Topics"](https://alexslemonade.github.io/refinebio-examples/04-advanced-topics/intro-to-advanced-topics.nb.html) section. Each analysis is self-contained and provides information with how to obtain the dataset used in the example from refine.bio. We encourage you to download the [`.Rmd`](http://rmarkdown.rstudio.com) and follow the "getting started" section in the example before diving into our analysis examples. @@ -41,7 +41,7 @@ We encourage you to download the [`.Rmd`](http://rmarkdown.rstudio.com) and foll - A README that introduces you to the analyses, concepts, requirements, and workflows for that module. - An R Notebook which consists of: - An R markdown (`.Rmd`) file(s) that you can use in RStudio to run the analysis and contains it's own "getting started" section which describes how to download the example dataset from refine.bio. - - An nb.html file that is the resulting output of the Rmd file rendered as an HTML file. + - An nb.html file that is the resulting output of the `.Rmd` file rendered as an HTML file. ## What you need to install to run the examples @@ -87,12 +87,12 @@ Saving one of our R Markdowns (the files that end in `.Rmd`) on your computer wi See [this guide using to R Notebooks](https://bookdown.org/yihui/rmarkdown/notebook.html#using-notebooks) for more information about inserting and executing code chunks. -## An important note about file paths and Rmds +## An important note about file paths and `.Rmd`s A `current directory` refers to where R will look for files or otherwise operate. Directories are the folders of files on your computer; a file path is the series of folders leading to the file you are referring to. R Markdown documents have the `current directory` always set as wherever the `.Rmd` file itself is saved. -This means all file paths specified in the `.Rmd` must be specified _relative_ to the location of the Rmd. +This means all file paths specified in the `.Rmd` must be specified _relative_ to the location of the `.Rmd`. For more practice with setting file paths in `.Rmd` files see these: diff --git a/01-getting-started/getting-started.html b/01-getting-started/getting-started.html index d100ad0e..68f895af 100644 --- a/01-getting-started/getting-started.html +++ b/01-getting-started/getting-started.html @@ -1693,7 +1693,7 @@

0.1 About refine.bio

0.2 About how this tutorial book is structured

-

This tutorial contains follow-along analysis examples for refinebio gene expression data. The analysis examples are organized by technology: “microarray” or “RNA-seq”, in addition to an “Advanced Topics” section. Each analysis is self-contained and provides information with how to obtain the dataset used in the example from refine.bio. We encourage you to download the .Rmd and follow the “getting started” section in the example before diving into our analysis examples.

+

This tutorial contains follow-along analysis examples for refine.bio gene expression data. The analysis examples are organized by technology: “microarray” or “RNA-seq”, in addition to an “Advanced Topics” section. Each analysis is self-contained and provides information with how to obtain the dataset used in the example from refine.bio. We encourage you to download the .Rmd and follow the “getting started” section in the example before diving into our analysis examples.

Each analysis contains:

  • A README that introduces you to the analyses, concepts, requirements, and workflows for that module.
    @@ -1702,7 +1702,7 @@

    0.2 About how this tutorial book
    • An R markdown (.Rmd) file(s) that you can use in RStudio to run the analysis and contains it’s own “getting started” section which describes how to download the example dataset from refine.bio.
    • -
    • An nb.html file that is the resulting output of the Rmd file rendered as an HTML file.
    • +
    • An nb.html file that is the resulting output of the .Rmd file rendered as an HTML file.

@@ -1738,9 +1738,9 @@

0.5 How to use R Markdown Documen

R Markdown documents also have the added benefit of producing HTML file output that is nicely rendered and easy to read. Saving one of our R Markdowns (the files that end in .Rmd) on your computer will create an HTML file containing the code and output to be saved alongside it (will end in .nb.html).

See this guide using to R Notebooks for more information about inserting and executing code chunks.

-
-

0.6 An important note about file paths and Rmds

-

A current directory refers to where R will look for files or otherwise operate. Directories are the folders of files on your computer; a file path is the series of folders leading to the file you are referring to. R Markdown documents have the current directory always set as wherever the .Rmd file itself is saved. This means all file paths specified in the .Rmd must be specified relative to the location of the Rmd.

+
+

0.6 An important note about file paths and .Rmds

+

A current directory refers to where R will look for files or otherwise operate. Directories are the folders of files on your computer; a file path is the series of folders leading to the file you are referring to. R Markdown documents have the current directory always set as wherever the .Rmd file itself is saved. This means all file paths specified in the .Rmd must be specified relative to the location of the .Rmd.

For more practice with setting file paths in .Rmd files see these:

  • This handy course chapter from Baumer and Crouser
  • diff --git a/02-microarray/clustering_microarray_01_heatmap.Rmd b/02-microarray/clustering_microarray_01_heatmap.Rmd index 7894fdba..57ebb68e 100644 --- a/02-microarray/clustering_microarray_01_heatmap.Rmd +++ b/02-microarray/clustering_microarray_01_heatmap.Rmd @@ -18,23 +18,23 @@ This notebook illustrates one way that you can use microarray data from refine.b # How to run this example For general information about our tutorials and the basic software packages you will need, please see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-this-tutorial-is-structured). -We recommend taking a look at our [Resources for Learning R](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#resources-for-learning-r) if you have not written code in R before. +We recommend taking a look at our [Resources for Learning R](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#resources-for-learning-r) if you have not written code in R before. ## Obtain the `.Rmd` file To run this example yourself, [download the `.Rmd` for this analysis by clicking this link](https://alexslemonade.github.io/refinebio-examples/02-microarray/clustering_microarray_01_heatmap.Rmd). You can open this `.Rmd` file in RStudio and follow the rest of these steps from there. (See our [section about getting started with R notebooks](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) if you are unfamiliar with `.Rmd` files.) -Clicking this link will most likely send this to your downloads folder on your computer. +Clicking this link will most likely send this to your downloads folder on your computer. Move this `.Rmd` file to where you would like this example and its files to be stored. -## Set up your analysis folders +## Set up your analysis folders Good file organization is helpful for keeping your data analysis project on track! -We have set up some code that will automatically set up a folder structure for you. -Run this next chunk to set up your folders! +We have set up some code that will automatically set up a folder structure for you. +Run this next chunk to set up your folders! -If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations. +If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations. ```{r} # Create the data folder if it doesn't exist @@ -63,7 +63,7 @@ In the same place you put this `.Rmd` file, you should now have three new empty ## Obtain the dataset from refine.bio -For general information about downloading data for these examples, see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-the-data). +For general information about downloading data for these examples, see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-the-data). Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/GSE24862). @@ -76,7 +76,7 @@ Fill out the pop up window with your email and our Terms and Conditions: It may take a few minutes for the dataset to process. -You will get an email when it is ready. +You will get an email when it is ready. ## About the dataset we are using for this example @@ -87,37 +87,37 @@ The samples were obtained from three PLX4032-sensitive parental and three PLX403 ## Place the dataset in your new `data/` folder -refine.bio will send you a download button in the email when it is ready. -Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in `.zip`. +refine.bio will send you a download button in the email when it is ready. +Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in `.zip`. Double clicking should unzip this for you and create a folder of the same name. - + For more details on the contents of this folder see [these docs on refine.bio](http://docs.refine.bio/en/latest/main_text.html#downloadable-files). The `` folder has the data and metadata TSV files you will need for this example analysis. -Experiment accession ids usually look something like `GSE1235` or `SRP12345`. +Experiment accession ids usually look something like `GSE1235` or `SRP12345`. Copy and paste the `GSE24862` folder into your newly created `data/` folder. ## Check out our file structure! -Your new analysis folder should contain: +Your new analysis folder should contain: -- The example analysis Rmd you downloaded +- The example analysis `.Rmd` you downloaded - A folder called "data" which contains: - The `GSE24862` folder which contains: - The gene expression - The metadata TSV - A folder for `plots` (currently empty) - A folder for `results` (currently empty) - -Your example analysis folder should now look something like this (except with respective experiment accession id and analysis notebook name you are using): + +Your example analysis folder should now look something like this (except with respective experiment accession id and analysis notebook name you are using): -In order for our example here to run without a hitch, we need these files to be in these locations so we've constructed a test to check before we get started with the analysis. -Run this chunk to double check that your files are in the right place. +In order for our example here to run without a hitch, we need these files to be in these locations so we've constructed a test to check before we get started with the analysis. +Run this chunk to double check that your files are in the right place. ```{r} # Define the file path to the data directory @@ -132,13 +132,13 @@ file.exists(file.path(data_dir, "metadata_GSE24862.tsv")) If the chunk above printed out `FALSE` to either of those tests, you won't be able to run this analysis _as is_ until those files are in the appropriate place. -If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). +If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). # Using a different refine.bio dataset with this analysis? If you'd like to adapt an example analysis to use a different dataset from [refine.bio](https://www.refine.bio/), we recommend placing the files in the `data/` directory you created and changing the filenames and paths in the notebook to match these files (we've put comments to signify where you would need to change the code). We suggest saving plots and results to `plots/` and `results/` directories, respectively, as these are automatically created by the notebook. -From here you can customize this analysis example to fit your own scientific questions and preferences. +From here you can customize this analysis example to fit your own scientific questions and preferences. *** @@ -196,7 +196,7 @@ Let's take a look at the metadata object that we read into the R environment. head(metadata) ``` -Now let's ensure that the metadata and data are in the same sample order. +Now let's ensure that the metadata and data are in the same sample order. ```{r} # Make the data in the order of the metadata @@ -211,7 +211,7 @@ Now we are going to use a combination of functions from base R and the `pheatmap ## Choose genes of interest Although you may want to create a heatmap including all of the genes in the set, alternatively, the heatmap could be created using only genes of interest. -For this example, we will sort genes by variance, but there are many alternative criterion by which you may want to sort your genes eg fold change, t-statistic, membership to a particular gene ontology, so on. +For this example, we will sort genes by variance, but there are many alternative criterion by which you may want to sort your genes eg fold change, t-statistic, membership to a particular gene ontology, so on. ```{r} # Calculate the variance for each gene @@ -256,7 +256,7 @@ First let's save our clustered heatmap. You can easily switch this to save to a JPEG or TIFF by changing the function and file name within the function to the respective file suffix. ```{r} -# Open a png file +# Open a PNG file png(file.path( plots_dir, "GSE24862_heatmap_non_annotated.png" # Replace file name with a relevant output plot name @@ -265,7 +265,7 @@ png(file.path( # Print your heatmap heatmap -# Close the png file: +# Close the PNG file: dev.off() ``` @@ -341,7 +341,7 @@ png(file.path( # Print your heatmap heatmap_annotated -# Close the png file: +# Close the PNG file: dev.off() ``` @@ -352,8 +352,8 @@ dev.off() # Print session info -At the end of every analysis, before saving your notebook, we recommend printing out your session info. -This helps make your code more reproducible by recording what versions of softwares and packages you used to run this. +At the end of every analysis, before saving your notebook, we recommend printing out your session info. +This helps make your code more reproducible by recording what versions of software and packages you used to run this. ```{r} # Print session info diff --git a/02-microarray/clustering_microarray_01_heatmap.html b/02-microarray/clustering_microarray_01_heatmap.html index 4f8cb90b..83f3ca2a 100644 --- a/02-microarray/clustering_microarray_01_heatmap.html +++ b/02-microarray/clustering_microarray_01_heatmap.html @@ -1666,7 +1666,7 @@

    Clustering Data - Microarray

    CCDL for ALSF

    -

    September 2020

    +

    October 2020

@@ -1737,7 +1737,7 @@

2.5 Place the dataset in your new

2.6 Check out our file structure!

Your new analysis folder should contain:

    -
  • The example analysis Rmd you downloaded
    +
  • The example analysis .Rmd you downloaded
  • A folder called “data” which contains:
      @@ -1910,7 +1910,7 @@

      4.4 Create a heatmap

      4.4.1 Save heatmap as a PNG

      You can easily switch this to save to a JPEG or TIFF by changing the function and file name within the function to the respective file suffix.

      -
      # Open a png file
      +
      # Open a PNG file
       png(file.path(
         plots_dir,
         "GSE24862_heatmap_non_annotated.png" # Replace file name with a relevant output plot name
      @@ -1919,7 +1919,7 @@ 

      4.4.1 Save heatmap as a PNG

      # Print your heatmap heatmap -# Close the png file: +# Close the PNG file: dev.off()
      ## png 
       ##   2
      @@ -1982,7 +1982,7 @@

      4.5.2 Save annotated heatmap as a # Print your heatmap heatmap_annotated -# Close the png file: +# Close the PNG file: dev.off()

      ## png 
       ##   2
      @@ -1998,7 +1998,7 @@

      5 Further learning resources abou

      @@ -1736,7 +1736,7 @@

      2.5 Place the dataset in your new

      2.6 Check out our file structure!

      Your new analysis folder should contain:

        -
      • The example analysis Rmd you downloaded
        +
      • The example analysis .Rmd you downloaded
      • A folder called “data” which contains:
          @@ -1942,7 +1942,7 @@

          4.5 Check results by plotting one
          ggplot(top_gene_df, aes(x = genotype, y = ENSDARG00000104315, color = genotype)) +
             geom_jitter(width = 0.2) + # We'll make this a jitter plot
             theme_classic() # This makes some aesthetic changes
          -

          +

          These results make sense. The overexpressing CREB group samples have much higher expression values for ENSDARG00000104315 than the control samples do.

@@ -1986,7 +1986,7 @@

4.7 Make a volcano plot

5 Resources for further learning

6 Session info

-

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of softwares and packages you used to run this.

+

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

# Print session info
 sessionInfo()
## R version 4.0.2 (2020-06-22)
@@ -2022,8 +2022,8 @@ 

6 Session info

## loaded via a namespace (and not attached): ## [1] Rcpp_1.0.5 pillar_1.4.6 compiler_4.0.2 ## [4] R.methodsS3_1.8.1 R.utils_2.10.1 tools_4.0.2 -## [7] digest_0.6.25 gtable_0.3.0 evaluate_0.14 -## [10] lifecycle_0.2.0 tibble_3.0.3 R.cache_0.14.0 +## [7] digest_0.6.25 evaluate_0.14 lifecycle_0.2.0 +## [10] tibble_3.0.3 gtable_0.3.0 R.cache_0.14.0 ## [13] pkgconfig_2.0.3 rlang_0.4.7 cli_2.0.2 ## [16] rstudioapi_0.11 ggrepel_0.8.2 yaml_2.2.1 ## [19] xfun_0.17 withr_2.3.0 dplyr_1.0.2 @@ -2032,11 +2032,11 @@

6 Session info

## [28] tidyselect_1.1.0 grid_4.0.2 getopt_1.20.3 ## [31] glue_1.4.2 R6_2.4.1 fansi_0.4.1 ## [34] rmarkdown_2.3 EnhancedVolcano_1.6.0 farver_2.0.3 -## [37] purrr_0.3.4 readr_1.3.1 rematch2_2.1.2 -## [40] scales_1.1.1 backports_1.1.10 ellipsis_0.3.1 -## [43] htmltools_0.5.0 assertthat_0.2.1 colorspace_1.4-1 -## [46] labeling_0.3 stringi_1.5.3 munsell_0.5.0 -## [49] crayon_1.3.4 R.oo_1.24.0
+## [37] purrr_0.3.4 readr_1.3.1 scales_1.1.1 +## [40] backports_1.1.10 ellipsis_0.3.1 htmltools_0.5.0 +## [43] assertthat_0.2.1 colorspace_1.4-1 labeling_0.3 +## [46] stringi_1.5.3 munsell_0.5.0 crayon_1.3.4 +## [49] R.oo_1.24.0

Blighe K., S. Rana, and M. Lewis, 2020 EnhancedVolcano: Publication-ready volcano plots with enhanced colouring and labeling.

diff --git a/02-microarray/differential-expression_microarray_02_several-groups.Rmd b/02-microarray/differential-expression_microarray_02_several-groups.Rmd index 3cda5de7..1d9ccf03 100644 --- a/02-microarray/differential-expression_microarray_02_several-groups.Rmd +++ b/02-microarray/differential-expression_microarray_02_several-groups.Rmd @@ -103,7 +103,7 @@ Copy and paste the `GSE37418` folder into your newly created `data/` folder. Your new analysis folder should contain: -- The example analysis Rmd you downloaded +- The example analysis `.Rmd` you downloaded - A folder called "data" which contains: - The `GSE37418` folder which contains: - The gene expression @@ -511,7 +511,7 @@ ggsave( # Resources for further learning -- [The refinebio example for differential expression for just 2 groups](https://alexslemonade.github.io/refinebio-examples/02-microarray/differential-expression_microarray_01_2-groups.html) +- [The refine.bio example for differential expression for just 2 groups](https://alexslemonade.github.io/refinebio-examples/02-microarray/differential-expression_microarray_01_2-groups.html) - [The full users guide on limma](https://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf) shows examples of limma functions for different experimental models [@Ritchie2015]. - [A general guide to differential expression, including a section about interpreting results](http://www.nathalievialaneix.eu/doc/pdf/tutorial-rnaseq.pdf) [@Gonzalez2014]. - [End to End workflow for Affymetrix microarray data](https://www.bioconductor.org/packages/devel/workflows/vignettes/maEndToEnd/inst/doc/MA-Workflow.html) [@Klaus2018]. @@ -521,7 +521,7 @@ ggsave( # Session info At the end of every analysis, before saving your notebook, we recommend printing out your session info. -This helps make your code more reproducible by recording what versions of softwares and packages you used to run this. +This helps make your code more reproducible by recording what versions of software and packages you used to run this. ```{r} # Print session info diff --git a/02-microarray/differential-expression_microarray_02_several-groups.html b/02-microarray/differential-expression_microarray_02_several-groups.html index e40aa064..7ffc212e 100644 --- a/02-microarray/differential-expression_microarray_02_several-groups.html +++ b/02-microarray/differential-expression_microarray_02_several-groups.html @@ -1666,7 +1666,7 @@

Differential Expression - Several groups - Microarray

CCDL for ALSF

-

September 2020

+

October 2020

@@ -1736,7 +1736,7 @@

2.5 Place the dataset in your new

2.6 Check out our file structure!

Your new analysis folder should contain:

    -
  • The example analysis Rmd you downloaded
    +
  • The example analysis .Rmd you downloaded
  • A folder called “data” which contains:
      @@ -1992,7 +1992,7 @@

      4.6 Check results by plotting one
      ggplot(top_gene_df, aes(x = subgroup, y = ENSG00000128683, color = subgroup)) +
         geom_jitter(width = 0.2) + # We'll make this a jitter plot
         theme_classic() # This makes some aesthetic changes
      -

      +

      Yes! These results make sense. The WNT samples have much higher expression of ENSG00000128683 than the other samples.

@@ -2103,7 +2103,7 @@

4.8 Make volcano plots

5 Resources for further learning

6 Session info

-

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of softwares and packages you used to run this.

+

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

# Print session info
 sessionInfo()
## R version 4.0.2 (2020-06-22)
@@ -2139,18 +2139,18 @@ 

6 Session info

## ## loaded via a namespace (and not attached): ## [1] Rcpp_1.0.5 pillar_1.4.6 compiler_4.0.2 R.methodsS3_1.8.1 -## [5] R.utils_2.10.1 tools_4.0.2 digest_0.6.25 gtable_0.3.0 -## [9] evaluate_0.14 lifecycle_0.2.0 tibble_3.0.3 R.cache_0.14.0 +## [5] R.utils_2.10.1 tools_4.0.2 digest_0.6.25 evaluate_0.14 +## [9] lifecycle_0.2.0 tibble_3.0.3 gtable_0.3.0 R.cache_0.14.0 ## [13] pkgconfig_2.0.3 rlang_0.4.7 cli_2.0.2 rstudioapi_0.11 ## [17] yaml_2.2.1 xfun_0.17 withr_2.3.0 dplyr_1.0.2 ## [21] styler_1.3.2 stringr_1.4.0 knitr_1.30 generics_0.0.2 ## [25] vctrs_0.3.4 hms_0.5.3 tidyselect_1.1.0 grid_4.0.2 ## [29] getopt_1.20.3 glue_1.4.2 R6_2.4.1 fansi_0.4.1 ## [33] rmarkdown_2.3 tidyr_1.1.2 farver_2.0.3 purrr_0.3.4 -## [37] readr_1.3.1 rematch2_2.1.2 scales_1.1.1 backports_1.1.10 -## [41] ellipsis_0.3.1 htmltools_0.5.0 assertthat_0.2.1 colorspace_1.4-1 -## [45] labeling_0.3 utf8_1.1.4 stringi_1.5.3 munsell_0.5.0 -## [49] crayon_1.3.4 R.oo_1.24.0
+## [37] readr_1.3.1 scales_1.1.1 backports_1.1.10 ellipsis_0.3.1 +## [41] htmltools_0.5.0 assertthat_0.2.1 colorspace_1.4-1 labeling_0.3 +## [45] utf8_1.1.4 stringi_1.5.3 munsell_0.5.0 crayon_1.3.4 +## [49] R.oo_1.24.0

Gonzalez I., 2014 Statistical analysis of rna-seq data.

diff --git a/02-microarray/dimension-reduction_microarray_01_pca.Rmd b/02-microarray/dimension-reduction_microarray_01_pca.Rmd index f685e1e1..95518e30 100644 --- a/02-microarray/dimension-reduction_microarray_01_pca.Rmd +++ b/02-microarray/dimension-reduction_microarray_01_pca.Rmd @@ -107,7 +107,7 @@ Copy and paste the `GSE37382` folder into your newly created `data/` folder. Your new analysis folder should contain: -- The example analysis Rmd you downloaded +- The example analysis `.Rmd` you downloaded - A folder called "data" which contains: - The `GSE37382` folder which contains: - The gene expression @@ -334,7 +334,7 @@ plot = pca_plot # Here we are giving the function the plot object that we want s # Session info At the end of every analysis, before saving your notebook, we recommend printing out your session info. -This helps make your code more reproducible by recording what versions of softwares and packages you used to run this. +This helps make your code more reproducible by recording what versions of software and packages you used to run this. ```{r} # Print session info diff --git a/02-microarray/dimension-reduction_microarray_01_pca.html b/02-microarray/dimension-reduction_microarray_01_pca.html index dd23a554..544a0cd2 100644 --- a/02-microarray/dimension-reduction_microarray_01_pca.html +++ b/02-microarray/dimension-reduction_microarray_01_pca.html @@ -1666,7 +1666,7 @@

PCA Visualization - Microarray

CCDL for ALSF

-

September 2020

+

October 2020

@@ -1737,7 +1737,7 @@

2.5 Place the dataset in your new

2.6 Check out our file structure!

Your new analysis folder should contain:

    -
  • The example analysis Rmd you downloaded
    +
  • The example analysis .Rmd you downloaded
  • A folder called “data” which contains:
      @@ -2285,7 +2285,7 @@

      5 Resources for further learning<

6 Session info

-

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of softwares and packages you used to run this.

+

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

# Print session info
 sessionInfo()
## R version 4.0.2 (2020-06-22)
@@ -2319,9 +2319,9 @@ 

6 Session info

## [25] vctrs_0.3.4 hms_0.5.3 tidyselect_1.1.0 grid_4.0.2 ## [29] getopt_1.20.3 glue_1.4.2 R6_2.4.1 fansi_0.4.1 ## [33] rmarkdown_2.3 farver_2.0.3 purrr_0.3.4 readr_1.3.1 -## [37] rematch2_2.1.2 scales_1.1.1 backports_1.1.10 ellipsis_0.3.1 -## [41] htmltools_0.5.0 assertthat_0.2.1 colorspace_1.4-1 labeling_0.3 -## [45] stringi_1.5.3 munsell_0.5.0 crayon_1.3.4 R.oo_1.24.0
+## [37] backports_1.1.10 scales_1.1.1 ellipsis_0.3.1 htmltools_0.5.0 +## [41] assertthat_0.2.1 colorspace_1.4-1 labeling_0.3 stringi_1.5.3 +## [45] munsell_0.5.0 crayon_1.3.4 R.oo_1.24.0

Brems M., 2017 A one-stop shop for principal component analysis

diff --git a/02-microarray/dimension-reduction_microarray_02_umap.Rmd b/02-microarray/dimension-reduction_microarray_02_umap.Rmd index f4897272..86a26000 100644 --- a/02-microarray/dimension-reduction_microarray_02_umap.Rmd +++ b/02-microarray/dimension-reduction_microarray_02_umap.Rmd @@ -106,7 +106,7 @@ Copy and paste the `GSE37382` folder into your newly created `data/` folder. Your new analysis folder should contain: -- The example analysis Rmd you downloaded +- The example analysis `.Rmd` you downloaded - A folder called "data" which contains: - The `GSE37382` folder which contains: - The gene expression @@ -332,7 +332,7 @@ plot = umap_plot # Here we are giving the function the plot object that we want # Session info At the end of every analysis, before saving your notebook, we recommend printing out your session info. -This helps make your code more reproducible by recording what versions of softwares and packages you used to run this. +This helps make your code more reproducible by recording what versions of software and packages you used to run this. ```{r} # Print session info diff --git a/02-microarray/dimension-reduction_microarray_02_umap.html b/02-microarray/dimension-reduction_microarray_02_umap.html index 8673961c..a626168e 100644 --- a/02-microarray/dimension-reduction_microarray_02_umap.html +++ b/02-microarray/dimension-reduction_microarray_02_umap.html @@ -1666,7 +1666,7 @@

UMAP Visualization - Microarray

CCDL for ALSF

-

September 2020

+

October 2020

@@ -1737,7 +1737,7 @@

2.5 Place the dataset in your new

2.6 Check out our file structure!

Your new analysis folder should contain:

    -
  • The example analysis Rmd you downloaded
    +
  • The example analysis .Rmd you downloaded
  • A folder called “data” which contains:
      @@ -1953,7 +1953,7 @@

      5 Resources for further learning<

6 Session info

-

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of softwares and packages you used to run this.

+

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

# Print session info
 sessionInfo()
## R version 4.0.2 (2020-06-22)
@@ -1978,20 +1978,20 @@ 

6 Session info

## [1] magrittr_1.5 ggplot2_3.3.2 umap_0.2.6.0 optparse_1.6.6 ## ## loaded via a namespace (and not attached): -## [1] reticulate_1.16 styler_1.3.2 tidyselect_1.1.0 xfun_0.17 -## [5] rematch2_2.1.2 purrr_0.3.4 lattice_0.20-41 colorspace_1.4-1 -## [9] vctrs_0.3.4 generics_0.0.2 htmltools_0.5.0 getopt_1.20.3 -## [13] yaml_2.2.1 rlang_0.4.7 R.oo_1.24.0 pillar_1.4.6 -## [17] glue_1.4.2 withr_2.3.0 R.utils_2.10.1 R.cache_0.14.0 -## [21] lifecycle_0.2.0 stringr_1.4.0 munsell_0.5.0 gtable_0.3.0 -## [25] R.methodsS3_1.8.1 evaluate_0.14 labeling_0.3 knitr_1.30 -## [29] fansi_0.4.1 Rcpp_1.0.5 readr_1.3.1 openssl_1.4.3 -## [33] backports_1.1.10 scales_1.1.1 jsonlite_1.7.1 farver_2.0.3 -## [37] RSpectra_0.16-0 hms_0.5.3 askpass_1.1 digest_0.6.25 -## [41] stringi_1.5.3 dplyr_1.0.2 grid_4.0.2 cli_2.0.2 -## [45] tools_4.0.2 tibble_3.0.3 crayon_1.3.4 pkgconfig_2.0.3 -## [49] ellipsis_0.3.1 Matrix_1.2-18 assertthat_0.2.1 rmarkdown_2.3 -## [53] rstudioapi_0.11 R6_2.4.1 compiler_4.0.2
+## [1] Rcpp_1.0.5 RSpectra_0.16-0 pillar_1.4.6 compiler_4.0.2 +## [5] R.methodsS3_1.8.1 R.utils_2.10.1 tools_4.0.2 digest_0.6.25 +## [9] gtable_0.3.0 jsonlite_1.7.1 lattice_0.20-41 evaluate_0.14 +## [13] lifecycle_0.2.0 tibble_3.0.3 R.cache_0.14.0 pkgconfig_2.0.3 +## [17] rlang_0.4.7 Matrix_1.2-18 cli_2.0.2 rstudioapi_0.11 +## [21] yaml_2.2.1 xfun_0.17 withr_2.3.0 dplyr_1.0.2 +## [25] styler_1.3.2 stringr_1.4.0 knitr_1.30 generics_0.0.2 +## [29] askpass_1.1 vctrs_0.3.4 hms_0.5.3 tidyselect_1.1.0 +## [33] grid_4.0.2 getopt_1.20.3 reticulate_1.16 glue_1.4.2 +## [37] R6_2.4.1 fansi_0.4.1 rmarkdown_2.3 farver_2.0.3 +## [41] purrr_0.3.4 readr_1.3.1 scales_1.1.1 backports_1.1.10 +## [45] ellipsis_0.3.1 htmltools_0.5.0 assertthat_0.2.1 colorspace_1.4-1 +## [49] labeling_0.3 stringi_1.5.3 munsell_0.5.0 openssl_1.4.3 +## [53] crayon_1.3.4 R.oo_1.24.0

Konopka T., 2020 Uniform manifold approximation and projection.

diff --git a/02-microarray/gene-id-annotation_microarray_01_ensembl.Rmd b/02-microarray/gene-id-annotation_microarray_01_ensembl.Rmd index a58183b6..546f1efc 100644 --- a/02-microarray/gene-id-annotation_microarray_01_ensembl.Rmd +++ b/02-microarray/gene-id-annotation_microarray_01_ensembl.Rmd @@ -275,11 +275,11 @@ Let's get a summary of the gene symbols returned in the `Symbol` column of our m summary(mapped_df$Symbol) ``` -There are 998 NA's in our data frame, which means that 998 out of the 17918 Ensembl IDs did not map to gene symbols. +There are 998 NAs in our data frame, which means that 998 out of the 17918 Ensembl IDs did not map to gene symbols. 998 out of 17918 is not too bad a rate, in our opinion, but note that different gene identifier types will have different mapping rates and that is to be expected. Regardless, it is always good to be aware of how many genes you are potentially "losing" if you rely on this new gene identifier you've mapped to for downstream analyses. -However, if you have almost all NA's it is possible that the function was executed incorrectly or you may want to consider using a different gene identifier, if possible. +However, if you have almost all NAs it is possible that the function was executed incorrectly or you may want to consider using a different gene identifier, if possible. Now let's check to see if we have any genes that were mapped to multiple symbols. @@ -346,7 +346,7 @@ readr::write_tsv(filtered_mapped_df, file.path( # Session info At the end of every analysis, before saving your notebook, we recommend printing out your session info. -This helps make your code more reproducible by recording what versions of softwares and packages you used to run this. +This helps make your code more reproducible by recording what versions of software and packages you used to run this. ```{r} # Print session info diff --git a/02-microarray/gene-id-annotation_microarray_01_ensembl.html b/02-microarray/gene-id-annotation_microarray_01_ensembl.html index 1c7f1e51..25bf00dd 100644 --- a/02-microarray/gene-id-annotation_microarray_01_ensembl.html +++ b/02-microarray/gene-id-annotation_microarray_01_ensembl.html @@ -1666,7 +1666,7 @@

Obtaining Annotation for Ensembl IDs - Microarray

CCDL for ALSF

-

September 2020

+

October 2020

@@ -1959,8 +1959,8 @@

4.4 Explore gene ID conversion
##    Length     Class      Mode 
 ##     17977 character character
-

There are 998 NA’s in our data frame, which means that 998 out of the 17918 Ensembl IDs did not map to gene symbols. 998 out of 17918 is not too bad a rate, in our opinion, but note that different gene identifier types will have different mapping rates and that is to be expected. Regardless, it is always good to be aware of how many genes you are potentially “losing” if you rely on this new gene identifier you’ve mapped to for downstream analyses.

-

However, if you have almost all NA’s it is possible that the function was executed incorrectly or you may want to consider using a different gene identifier, if possible.

+

There are 998 NAs in our data frame, which means that 998 out of the 17918 Ensembl IDs did not map to gene symbols. 998 out of 17918 is not too bad a rate, in our opinion, but note that different gene identifier types will have different mapping rates and that is to be expected. Regardless, it is always good to be aware of how many genes you are potentially “losing” if you rely on this new gene identifier you’ve mapped to for downstream analyses.

+

However, if you have almost all NAs it is possible that the function was executed incorrectly or you may want to consider using a different gene identifier, if possible.

Now let’s check to see if we have any genes that were mapped to multiple symbols.

multi_mapped <- mapped_df %>%
   # Let's group by the Ensembl IDs in the `Ensembl` column
@@ -2024,7 +2024,7 @@ 

5 Resources for further learning<

6 Session info

-

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of softwares and packages you used to run this.

+

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

# Print session info
 sessionInfo()
## R version 4.0.2 (2020-06-22)
@@ -2061,9 +2061,9 @@ 

6 Session info

## [25] knitr_1.30 generics_0.0.2 vctrs_0.3.4 hms_0.5.3 ## [29] tidyselect_1.1.0 bit64_4.0.5 getopt_1.20.3 glue_1.4.2 ## [33] R6_2.4.1 fansi_0.4.1 rmarkdown_2.3 tidyr_1.1.2 -## [37] blob_1.2.1 purrr_0.3.4 readr_1.3.1 rematch2_2.1.2 -## [41] backports_1.1.10 ellipsis_0.3.1 htmltools_0.5.0 assertthat_0.2.1 -## [45] utf8_1.1.4 stringi_1.5.3 crayon_1.3.4 R.oo_1.24.0
+## [37] blob_1.2.1 purrr_0.3.4 readr_1.3.1 backports_1.1.10 +## [41] ellipsis_0.3.1 htmltools_0.5.0 assertthat_0.2.1 utf8_1.1.4 +## [45] stringi_1.5.3 crayon_1.3.4 R.oo_1.24.0

Carlson M., 2019 Genome wide annotation for mouse

diff --git a/02-microarray/ortholog_mapping_microarray_01.Rmd b/02-microarray/ortholog_mapping_microarray_01.Rmd index b30ec003..551ceff5 100644 --- a/02-microarray/ortholog_mapping_microarray_01.Rmd +++ b/02-microarray/ortholog_mapping_microarray_01.Rmd @@ -2,23 +2,23 @@ title: "Ortholog Mapping with `hcop`" author: ALSF CCDL - Jaclyn Taroni output: - html_notebook: + html_notebook: toc: true toc_float: true --- -*Purpose*: This notebook demonstrates how you can use the -[`hcop`](https://github.com/stephenturner/hcop) package to perform ortholog -mapping for data obtained from refine.bio. +*Purpose*: This notebook demonstrates how you can use the +[`hcop`](https://github.com/stephenturner/hcop) package to perform ortholog +mapping for data obtained from refine.bio. HCOP stands for HGNC Comparison of Orthology Predictions. -You can read more about the package +You can read more about the package [here](https://stephenturner.github.io/hcop). ## 1) Install `hcop` -We'll need to use the -[`devtools`](https://cran.r-project.org/web/packages/devtools/index.html) -package to install `hcop` from Github. +We'll need to use the +[`devtools`](https://cran.r-project.org/web/packages/devtools/index.html) +package to install `hcop` from GitHub. `devtools` can be installed using the instructions below. Note that this will first check if `devtools` is installed and install it if it is not. @@ -31,15 +31,15 @@ if (!("devtools" %in% installed.packages())) { _Note:_ `devtools` requires `git2r` which we've encountered trouble installing before. -Specifically, `git2r` requires the [`libgit2`](https://libgit2.org/) and +Specifically, `git2r` requires the [`libgit2`](https://libgit2.org/) and [`zlib`](https://zlib.net/) system libraries. -If you're using a Docker image from the -[Rocker project](https://www.rocker-project.org/) without these dependencies -(e.g., `rocker/rstudio`), follow +If you're using a Docker image from the +[Rocker project](https://www.rocker-project.org/) without these dependencies +(e.g., `rocker/rstudio`), follow [these instructions](https://github.com/rocker-org/rocker/wiki/Using-the-RStudio-image#dependencies-external-to-the-r-system). -Now we'll install `hcop` from Github. -You can control what version of the package is installed by using the `ref` +Now we'll install `hcop` from GitHub. +You can control what version of the package is installed by using the `ref` command of `devtools::install_github`. Here, we'll use the most recent commit at the time that we are putting together this example. @@ -64,11 +64,11 @@ if (!dir.exists(results_dir)) { ## 2) Mapping zebrafish Ensembl gene IDs to human symbols -`hcop` supports mapping between zebrafish and human identifiers. +`hcop` supports mapping between zebrafish and human identifiers. Here, we'll use zebrafish data from refine.bio and annotate it with human gene symbols. -In keeping with the `hcop` documentation, we'll use tidyverse packages -(e.g., [`dplyr`](https://dplyr.tidyverse.org/) and +In keeping with the `hcop` documentation, we'll use tidyverse packages +(e.g., [`dplyr`](https://dplyr.tidyverse.org/) and [`readr`](https://readr.tidyverse.org)) below. ```{r Load hcop and dplyr libraries} @@ -76,7 +76,7 @@ library(hcop) library(dplyr) ``` -Let's read in the tsv file from refine.bio. +Let's read in the TSV file from refine.bio. We'll convert _all_ identifiers in the file, rather than selecting a few. Because of the formatting of the output of refine.bio, the first column name will be filled in with `'X1'`. @@ -88,7 +88,7 @@ exprs.df <- readr::read_tsv(file.path( )) ``` -refine.bio data uses Ensembl gene identifiers, which will be in the first +refine.bio data uses Ensembl gene identifiers, which will be in the first column. ```{r Examine identifiers head} @@ -121,7 +121,7 @@ Here's what the new data.frame looks like: head(human.symbol.df, 25) ``` -## 3) Write newly annotated data to a tsv file +## 3) Write newly annotated data to a TSV file ```{r Write to file} readr::write_tsv( @@ -138,8 +138,8 @@ readr::write_tsv( * Multiple zebrafish Ensembl gene IDs map to the same human symbol which results in duplicated human gene symbols. Some downstream tools may need this to be resolved. -* If a zebrafish Ensembl gene ID maps to multiple human symbols, the gene -expression values get duplicated. Let's look at the `ENSDARG00000069142` +* If a zebrafish Ensembl gene ID maps to multiple human symbols, the gene +expression values get duplicated. Let's look at the `ENSDARG00000069142` example below. ```{r} diff --git a/02-microarray/pathway_analysis_microarray_00_intro.Rmd b/02-microarray/pathway_analysis_microarray_00_intro.Rmd index c17a6b5a..dde61bff 100644 --- a/02-microarray/pathway_analysis_microarray_00_intro.Rmd +++ b/02-microarray/pathway_analysis_microarray_00_intro.Rmd @@ -1,6 +1,6 @@ --- title: "Over-representation analysis with WebGestaltR" -output: +output: html_notebook: toc: TRUE toc_float: TRUE @@ -15,25 +15,25 @@ where one can ask if a set of genes (e.g., those differentially expressed using some cutoff) shares more or less genes with gene sets/pathways than we would expect at random. The other methodologies introduced throughout this module such as QuSAGE and -GSEA can require more samples than a different expression analysis. +GSEA can require more samples than a different expression analysis. For instance, the sample label permutation step of GSEA is reported to -perform poorly with 7 samples or less in each group -([Yaari et al. _NAR_. 2013.](https://doi.org/10.1093/nar/gkt660)). -It is not uncommon to have n ~ 3 for each group in a treatment-control +perform poorly with 7 samples or less in each group +([](https://doi.org/10.1093/nar/gkt660)). +It is not uncommon to have n ~ 3 for each group in a treatment-control transcriptomic study, at which point identifying differentially expressed genes is possible. If you are interested in performing pathway analysis on a small study, ORA may be your best bet. -There are some limitations to ORA methods to be aware such as ignoring +There are some limitations to ORA methods to be aware such as ignoring gene-gene correlation. -See [Khatri et al. _PLoS Comp Bio._ 2012.](https://doi.org/10.1371/journal.pcbi.1002375) -to learn more about the different types of pathway analysis and their +See [](https://doi.org/10.1371/journal.pcbi.1002375) +to learn more about the different types of pathway analysis and their limitations. ## Data In this example, we will use a table of differential expression analysis results -from another one of the example modules +from another one of the example modules ([`validate-differential-expression`](https://github.com/AlexsLemonade/refinebio-examples/tree/master/validate-differential-expression)). Genes were tested for differential expression between the SHH and Groups 3 and 4 subgroups of medulloblastoma in refine.bio-processed data from @@ -76,7 +76,7 @@ if (!dir.exists(results_dir)) { ### Differentially expressed genes from Robinson et al. We will read in the differential expression results from GitHub. -These results are from a two group comparison using +These results are from a two group comparison using [`limma`](https://bioconductor.org/packages/release/bioc/html/limma.html). The table contains Ensembl gene IDs, log fold-change, and adjusted p-values (FDR in this case). @@ -91,7 +91,7 @@ dge_url <- "https://github.com/AlexsLemonade/refinebio-examples/raw/10b116dff0d4 dge_df <- readr::read_tsv(dge_url) ``` -Here we'll use log fold-change > 2 and FDR < 0.05 as cutoffs for determining +Here we'll use log fold-change > 2 and FDR < 0.05 as cutoffs for determining what genes are of interest. ```{r} @@ -100,10 +100,10 @@ upregulated_genes <- dge_df %>% dplyr::pull(ENSEMBL) ``` -Because we are testing if there is more overlap between a set of genes of -interest and gene sets or pathways from some knowledgebase (e.g., KEGG, Gene -Ontology (GO)) than we would expect at random, we need to identify an -appropriate background set. +Because we are testing if there is more overlap between a set of genes of +interest and gene sets or pathways from some knowledge base (e.g., KEGG, Gene +Ontology (GO)) than we would expect at random, we need to identify an +appropriate background set. Put another way, if a gene is _not measured_, it can not possibly be in our gene set of interest. We can provide our analysis method of choice with a reference list, which @@ -119,7 +119,7 @@ all_genes <- dge_df %>% We can check whether or not we need to convert to a different gene identifier by figuring out what gene identifiers `WebGestaltR` accepts for human. -We can do this with the `listIdType` function; the first argument is the +We can do this with the `listIdType` function; the first argument is the organism name. ```{r} @@ -130,11 +130,11 @@ We can see that `"ensembl_gene_id"` is a compatible identifier, and therefore we do not need to convert to a different identifier. The `WebGestaltR` function is a wrapper for the [WEB-based GEne SeT -AnaLysis Toolkit (WebGestalt)](http://www.webgestalt.org/) -([Wang et al. _NAR_. 2017.](https://doi.org/10.1093/nar/gkx356); +AnaLysis Toolkit (WebGestalt)](http://www.webgestalt.org/) +([Wang et al. . 2017.](https://doi.org/10.1093/nar/gkx356); note that WebGestalt has a new 2019 version). -WebGestalt can perform multiple _types_ of pathway analysis. +WebGestalt can perform multiple _types_ of pathway analysis. Here we're using it for ORA and we will use [Gene Ontology (GO)](http://geneontology.org/docs/ontology-documentation/) biological processes as our source of gene sets. We can see all supported gene sets for humans with `listGeneSet`. @@ -144,7 +144,7 @@ listGeneSet("hsapiens") ``` `WebGestaltR` will generate an HTML file with a report (path specified -by the `outputDirectory` and `projectName` arguments) _and_ return a +by the `outputDirectory` and `projectName` arguments) _and_ return a `data.frame`. ```{r} @@ -164,15 +164,15 @@ go_enrichment_results <- As noted in the messages from `WebGestaltR`, the results are in `results/Project_GSE37418_up_SHH_lfc2_fdr0_05`. -The HTML report is at +The HTML report is at [`results/Project_GSE37418_up_SHH_lfc2_fdr0_05/Report_GSE37418_up_SHH_lfc2_fdr0_05.html`](./results/Project_GSE37418_up_SHH_lfc2_fdr0_05/Report_GSE37418_up_SHH_lfc2_fdr0_05.html). It looks like there are a lot of pathways associated with ribosomes and translation. Gene sets, particularly GO gene sets that are of a hierarchical nature, are -not independent, so it's important to keep in mind that many of these gene sets +not independent, so it's important to keep in mind that many of these gene sets could be telling us the same thing. -For more information about WebGestalt output or advanced options, +For more information about WebGestalt output or advanced options, please see the [WebGestalt 2019 Manual](http://www.webgestalt.org/WebGestalt_2019_Manual.pdf) and the [`WebGestaltR` package documentation](https://www.rdocumentation.org/packages/WebGestaltR/versions/0.3.0). diff --git a/02-microarray/pathway_analysis_microarray_01_ortholog_mapping_kegg.Rmd b/02-microarray/pathway_analysis_microarray_01_ortholog_mapping_kegg.Rmd index 8cfd6333..2960f834 100644 --- a/02-microarray/pathway_analysis_microarray_01_ortholog_mapping_kegg.Rmd +++ b/02-microarray/pathway_analysis_microarray_01_ortholog_mapping_kegg.Rmd @@ -10,7 +10,7 @@ date: 2019 ## Background -In this module, we use QuSAGE ([Yaari et al. _NAR_. 2013.](https://doi.org/10.1093/nar/gkt660)) +In this module, we use QuSAGE ([](https://doi.org/10.1093/nar/gkt660)) for pathway analysis (implemented in the [`qusage` bioconductor package](https://bioconductor.org/packages/release/bioc/html/qusage.html)). `qusage` allows you to read in gene sets that are in the [GMT format](http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29). diff --git a/02-microarray/pathway_analysis_microarray_02_ora_with_webgestaltr.Rmd b/02-microarray/pathway_analysis_microarray_02_ora_with_webgestaltr.Rmd index c17a6b5a..4313bb93 100644 --- a/02-microarray/pathway_analysis_microarray_02_ora_with_webgestaltr.Rmd +++ b/02-microarray/pathway_analysis_microarray_02_ora_with_webgestaltr.Rmd @@ -1,6 +1,6 @@ --- title: "Over-representation analysis with WebGestaltR" -output: +output: html_notebook: toc: TRUE toc_float: TRUE @@ -15,25 +15,25 @@ where one can ask if a set of genes (e.g., those differentially expressed using some cutoff) shares more or less genes with gene sets/pathways than we would expect at random. The other methodologies introduced throughout this module such as QuSAGE and -GSEA can require more samples than a different expression analysis. +GSEA can require more samples than a different expression analysis. For instance, the sample label permutation step of GSEA is reported to -perform poorly with 7 samples or less in each group -([Yaari et al. _NAR_. 2013.](https://doi.org/10.1093/nar/gkt660)). -It is not uncommon to have n ~ 3 for each group in a treatment-control +perform poorly with 7 samples or less in each group +([](https://doi.org/10.1093/nar/gkt660)). +It is not uncommon to have n ~ 3 for each group in a treatment-control transcriptomic study, at which point identifying differentially expressed genes is possible. If you are interested in performing pathway analysis on a small study, ORA may be your best bet. -There are some limitations to ORA methods to be aware such as ignoring +There are some limitations to ORA methods to be aware such as ignoring gene-gene correlation. -See [Khatri et al. _PLoS Comp Bio._ 2012.](https://doi.org/10.1371/journal.pcbi.1002375) -to learn more about the different types of pathway analysis and their +See [ et al. _PLoS Comp Bio._ 2012.](https://doi.org/10.1371/journal.pcbi.1002375) +to learn more about the different types of pathway analysis and their limitations. ## Data In this example, we will use a table of differential expression analysis results -from another one of the example modules +from another one of the example modules ([`validate-differential-expression`](https://github.com/AlexsLemonade/refinebio-examples/tree/master/validate-differential-expression)). Genes were tested for differential expression between the SHH and Groups 3 and 4 subgroups of medulloblastoma in refine.bio-processed data from @@ -76,7 +76,7 @@ if (!dir.exists(results_dir)) { ### Differentially expressed genes from Robinson et al. We will read in the differential expression results from GitHub. -These results are from a two group comparison using +These results are from a two group comparison using [`limma`](https://bioconductor.org/packages/release/bioc/html/limma.html). The table contains Ensembl gene IDs, log fold-change, and adjusted p-values (FDR in this case). @@ -91,7 +91,7 @@ dge_url <- "https://github.com/AlexsLemonade/refinebio-examples/raw/10b116dff0d4 dge_df <- readr::read_tsv(dge_url) ``` -Here we'll use log fold-change > 2 and FDR < 0.05 as cutoffs for determining +Here we'll use log fold-change > 2 and FDR < 0.05 as cutoffs for determining what genes are of interest. ```{r} @@ -100,10 +100,10 @@ upregulated_genes <- dge_df %>% dplyr::pull(ENSEMBL) ``` -Because we are testing if there is more overlap between a set of genes of -interest and gene sets or pathways from some knowledgebase (e.g., KEGG, Gene -Ontology (GO)) than we would expect at random, we need to identify an -appropriate background set. +Because we are testing if there is more overlap between a set of genes of +interest and gene sets or pathways from some knowledge base (e.g., KEGG, Gene +Ontology (GO)) than we would expect at random, we need to identify an +appropriate background set. Put another way, if a gene is _not measured_, it can not possibly be in our gene set of interest. We can provide our analysis method of choice with a reference list, which @@ -119,7 +119,7 @@ all_genes <- dge_df %>% We can check whether or not we need to convert to a different gene identifier by figuring out what gene identifiers `WebGestaltR` accepts for human. -We can do this with the `listIdType` function; the first argument is the +We can do this with the `listIdType` function; the first argument is the organism name. ```{r} @@ -130,11 +130,11 @@ We can see that `"ensembl_gene_id"` is a compatible identifier, and therefore we do not need to convert to a different identifier. The `WebGestaltR` function is a wrapper for the [WEB-based GEne SeT -AnaLysis Toolkit (WebGestalt)](http://www.webgestalt.org/) -([Wang et al. _NAR_. 2017.](https://doi.org/10.1093/nar/gkx356); +AnaLysis Toolkit (WebGestalt)](http://www.webgestalt.org/) +([Wang et al. . 2017.](https://doi.org/10.1093/nar/gkx356); note that WebGestalt has a new 2019 version). -WebGestalt can perform multiple _types_ of pathway analysis. +WebGestalt can perform multiple _types_ of pathway analysis. Here we're using it for ORA and we will use [Gene Ontology (GO)](http://geneontology.org/docs/ontology-documentation/) biological processes as our source of gene sets. We can see all supported gene sets for humans with `listGeneSet`. @@ -144,7 +144,7 @@ listGeneSet("hsapiens") ``` `WebGestaltR` will generate an HTML file with a report (path specified -by the `outputDirectory` and `projectName` arguments) _and_ return a +by the `outputDirectory` and `projectName` arguments) _and_ return a `data.frame`. ```{r} @@ -164,15 +164,15 @@ go_enrichment_results <- As noted in the messages from `WebGestaltR`, the results are in `results/Project_GSE37418_up_SHH_lfc2_fdr0_05`. -The HTML report is at +The HTML report is at [`results/Project_GSE37418_up_SHH_lfc2_fdr0_05/Report_GSE37418_up_SHH_lfc2_fdr0_05.html`](./results/Project_GSE37418_up_SHH_lfc2_fdr0_05/Report_GSE37418_up_SHH_lfc2_fdr0_05.html). It looks like there are a lot of pathways associated with ribosomes and translation. Gene sets, particularly GO gene sets that are of a hierarchical nature, are -not independent, so it's important to keep in mind that many of these gene sets +not independent, so it's important to keep in mind that many of these gene sets could be telling us the same thing. -For more information about WebGestalt output or advanced options, +For more information about WebGestalt output or advanced options, please see the [WebGestalt 2019 Manual](http://www.webgestalt.org/WebGestalt_2019_Manual.pdf) and the [`WebGestaltR` package documentation](https://www.rdocumentation.org/packages/WebGestaltR/versions/0.3.0). diff --git a/02-microarray/pathway_analysis_microarray_03_qusage_meta_analysis.Rmd b/02-microarray/pathway_analysis_microarray_03_qusage_meta_analysis.Rmd index bfc9f7bc..dd62f890 100644 --- a/02-microarray/pathway_analysis_microarray_03_qusage_meta_analysis.Rmd +++ b/02-microarray/pathway_analysis_microarray_03_qusage_meta_analysis.Rmd @@ -11,7 +11,7 @@ date: 2019 ## Background The Quantitative Set Analysis of Gene Expression (QuSAGE) -([Yaari et al. _NAR_. 2013.](https://doi.org/10.1093/nar/gkt660)) framework +([](https://doi.org/10.1093/nar/gkt660)) framework has advantages that we outline in [`qusage_single_dataset`](./qusage_single_dataset.nb.html), including the fact that it returns more than just a p-value. @@ -21,7 +21,7 @@ If we're interested in pathway analysis of multiple datasets, QuSAGE allows us to perform a _meta-analysis_ by combining distributions from the QuSAGE results from each dataset. Meta-analysis with QuSAGE is described in -[Meng et al. _PLoS Comp Bio._ 2019.](https://doi.org/10.1371/journal.pcbi.1006899) +[ et al. _PLoS Comp Bio._ 2019.](https://doi.org/10.1371/journal.pcbi.1006899) and implemented in the [`qusage` bioconductor package](https://bioconductor.org/packages/release/bioc/html/qusage.html). The [`qusage` vignette](https://bioconductor.org/packages/release/bioc/vignettes/qusage/inst/doc/qusage.pdf) contains a section on meta-analysis. @@ -472,7 +472,7 @@ We can look into what genes are driving the pathway activity with the `plotCIsGenes` function. Gene activity, which will be plotted on the y-axis, is difference between the two groups. -The _pathway_ CI will also be displayed on the plot as a grey band by default. +The _pathway_ CI will also be displayed on the plot as a gray band by default. ```{r} plotCIsGenes(northcott_results, diff --git a/02-microarray/pathway_analysis_microarray_04_qusage_replicate_vignette.Rmd b/02-microarray/pathway_analysis_microarray_04_qusage_replicate_vignette.Rmd index 7edc2154..54a93799 100644 --- a/02-microarray/pathway_analysis_microarray_04_qusage_replicate_vignette.Rmd +++ b/02-microarray/pathway_analysis_microarray_04_qusage_replicate_vignette.Rmd @@ -10,7 +10,7 @@ date: 2019 ## Background -Here, we will replicate the [`qusage` package vignette](https://bioconductor.org/packages/release/bioc/vignettes/qusage/inst/doc/qusage.pdf) (Bolen C.). +Here, we will replicate the [`qusage` package vignette](https://bioconductor.org/packages/release/bioc/vignettes/qusage/inst/doc/qusage.pdf) ( C.). Specifically, we'll use the same dataset and analysis as the vignette, but the expression data and sample metadata we will use is processed with refine.bio. This allows us to explore formatting refine.bio datasets for use with diff --git a/02-microarray/pathway_analysis_microarray_05_qusage_single_dataset.Rmd b/02-microarray/pathway_analysis_microarray_05_qusage_single_dataset.Rmd index ee80ea24..27b903e3 100644 --- a/02-microarray/pathway_analysis_microarray_05_qusage_single_dataset.Rmd +++ b/02-microarray/pathway_analysis_microarray_05_qusage_single_dataset.Rmd @@ -1,6 +1,6 @@ --- title: "Pathway analysis with QuSAGE: Single dataset" -output: +output: html_notebook: toc: TRUE toc_float: TRUE @@ -11,42 +11,42 @@ date: 2019 ## Background In this module, we'll demonstrate how to perform pathway analysis using -Quantitative Set Analysis of Gene Expression (QuSAGE) -([Yaari et al. _NAR_. 2013.](https://doi.org/10.1093/nar/gkt660)). +Quantitative Set Analysis of Gene Expression (QuSAGE) +([](https://doi.org/10.1093/nar/gkt660)). QuSAGE, implemented in the [`qusage` bioconductor package](https://bioconductor.org/packages/release/bioc/html/qusage.html), has some nice features: * It takes into account inter-gene correlation (a source of type I error). -* It returns more information than just a p-value. +* It returns more information than just a p-value. That's useful for analyses you might want to perform downstream. * Built-in visualization functionality. -We recommend taking a look at the original publication (Yaari et al.) and +We recommend taking a look at the original publication (Yaari et al.) and the R package documentation to learn more. ## Gene sets `qusage` allows you to read in gene sets that are in the [GMT format](http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29). -[Curated gene sets from MSigDB](http://software.broadinstitute.org/gsea/msigdb/collections.jsp#C2) +[Curated gene sets from MSigDB](http://software.broadinstitute.org/gsea/msigdb/collections.jsp#C2) like [KEGG](https://www.genome.jp/kegg/) are popular for pathway analysis, but MSigDB only distributes human pathway data. Here, we'll work with a mouse dataset. -In the [`kegg_ortholog_mapping`](./kegg_ortholog_mapping.nb.html) notebook in -this module, we mapped human Entrez IDs to mouse symbols using the +In the [`kegg_ortholog_mapping`](./kegg_ortholog_mapping.nb.html) notebook in +this module, we mapped human Entrez IDs to mouse symbols using the [`hcop` package](https://github.com/stephenturner/hcop). When there was a 1:many mapping between human Entrez IDs and mouse gene symbols, -we selected the mouse gene symbol with the highest number of resources +we selected the mouse gene symbol with the highest number of resources supporting the mapping. This decision might not be suitable for every experiment and may result in some loss of information. ## Dataset -We're using [`GSE75574`](https://www.refine.bio/experiments/GSE75574/gene-expression-in-mouse-tissues-in-response-to-short-term-calorie-restriction) in this notebook. +We're using [`GSE75574`](https://www.refine.bio/experiments/GSE75574/gene-expression-in-mouse-tissues-in-response-to-short-term-calorie-restriction) in this notebook. This dataset assays the gene expression response to short-term calorie restriction in multiple tissues from multiple mouse strains. -We'll test for pathways that change in response to short-term calorie +We'll test for pathways that change in response to short-term calorie restriction. ## Set up @@ -102,7 +102,7 @@ data_dir <- "data" ## Read in refine.bio data -The gene expression matrix of the dataset we'll be working with is too large +The gene expression matrix of the dataset we'll be working with is too large to be tracked with git without compression, so we need to unzip it if we have not already. @@ -135,10 +135,10 @@ exprs_df <- readr::read_tsv(expression_file, progress = FALSE) colnames(exprs_df)[1] <- "ENSEMBL" ``` -Because our gene sets use gene symbols and expression data from refine.bio uses +Because our gene sets use gene symbols and expression data from refine.bio uses Ensembl IDs, we need to do a conversion. -We're using the default behavior for 1:many mappings here, where only the first -one is selected +We're using the default behavior for 1:many mappings here, where only the first +one is selected ([docs](https://www.rdocumentation.org/packages/AnnotationDbi/versions/1.30.1/topics/AnnotationDb-objects)). ```{r} @@ -221,7 +221,7 @@ metadata_df <- metadata_df %>% ``` This particular accession ([`GSE75574`](https://www.refine.bio/experiments/GSE75574/gene-expression-in-mouse-tissues-in-response-to-short-term-calorie-restriction)) is a SuperSeries comprised of experiments -in multiple tissues from multiple mouse strains. +in multiple tissues from multiple mouse strains. We'll use white adipose tissue for all strains in this example. ```{r} @@ -247,8 +247,8 @@ metadata_df <- metadata_df %>% ### Read in KEGG pathways First, we need the sets of genes that represent pathways. -Again, these were prepared in the -[`kegg_ortholog_mapping`](./kegg_ortholog_mapping.nb.html) notebook (see +Again, these were prepared in the +[`kegg_ortholog_mapping`](./kegg_ortholog_mapping.nb.html) notebook (see [Gene Sets](#gene-sets) above). ```{r} @@ -296,12 +296,12 @@ readr::write_rds( ### Overall results -We can get a look at the general trend of the results with the `plotCIs` -function. -This plots the means and 95% confidence intervals of each pathway we tested, -sorted such that the gene sets with the highest mean will be on the left of the +We can get a look at the general trend of the results with the `plotCIs` +function. +This plots the means and 95% confidence intervals of each pathway we tested, +sorted such that the gene sets with the highest mean will be on the left of the plot. -These are gene sets that are elevated in calorie restricted mice. +These are gene sets that are elevated in calorie restricted mice. Error bars are colored by the directionality and corrected p-value (FDR by default). Unfortunately the p-value color scheme is red-green, which does not work well @@ -325,7 +325,7 @@ dev.off() ``` We can also look at the log fold-change and FDR values for pathways with the -`qsTable` function. +`qsTable` function. By default, this function shows you the top 20 pathways sorted by FDR. We can change the `number` argument to `qsTable` to decrease or increase the number of pathways returned. @@ -351,7 +351,7 @@ groups. ### KEGG ECM Receptor Interaction -The KEGG ECM Receptor Interation pathway expression is reduced in response +The KEGG ECM Receptor Interaction pathway expression is reduced in response to calorie restriction. We can look at the distribution of genes in this pathway with the @@ -381,7 +381,7 @@ Let's look at another example with the opposite directionality. ### KEGG Steroid Biosynthesis -The KEGG Steroid Biosynthesis pathway expression is increased in calorie +The KEGG Steroid Biosynthesis pathway expression is increased in calorie restricted adipose tissue. ```{r} diff --git a/02-microarray/pathway_analysis_microarray_06_ssgsea.Rmd b/02-microarray/pathway_analysis_microarray_06_ssgsea.Rmd index cecd4a5c..d00bfb7d 100644 --- a/02-microarray/pathway_analysis_microarray_06_ssgsea.Rmd +++ b/02-microarray/pathway_analysis_microarray_06_ssgsea.Rmd @@ -12,9 +12,9 @@ date: 2019 Pathway or gene set analysis methods like Quantitative Set Analysis of Gene Expression (QuSAGE) -([Yaari et al. _NAR_. 2013.](https://doi.org/10.1093/nar/gkt660)) or Gene Set +([](https://doi.org/10.1093/nar/gkt660)) or Gene Set Enrichment Analysis (GSEA) -([Subramanian et al. _PNAS_. 2005.](https://doi.org/10.1073/pnas.0506580102)) +([ et al. _PNAS_. 2005.](https://doi.org/10.1073/pnas.0506580102)) require us to specify group labels. We may want a better idea of what pathways are up- or down-regulated in _individual samples_ if we, for example, suspect that there are subgroups of diff --git a/03-rnaseq/00-intro-to-rnaseq.Rmd b/03-rnaseq/00-intro-to-rnaseq.Rmd index d4acc82f..0d0eb707 100644 --- a/03-rnaseq/00-intro-to-rnaseq.Rmd +++ b/03-rnaseq/00-intro-to-rnaseq.Rmd @@ -68,7 +68,7 @@ See here for more about the [quantile normalization process in refine.bio](http: ### More resources on RNA-seq technology -- [StatsQuest: A gentle introduction to RNA-seq](https://www.youtube.com/watch?v=tlf6wYJrwKY) [@Starmer2017-rnaseq]. +- [StatQuest: A gentle introduction to RNA-seq](https://www.youtube.com/watch?v=tlf6wYJrwKY) [@Starmer2017-rnaseq]. - [A general background on the wet lab methods of RNA-seq](https://bitesizebio.com/13542/what-everyone-should-know-about-rna-seq/) [@Hadfield2016]. - [Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5143225/) [@Love2016]. - [Mike Love blog post about sequencing biases]( https://mikelove.wordpress.com/2016/09/26/rna-seq-fragment-sequence-bias/) [@bias-blog] @@ -96,7 +96,7 @@ Which is why our examples advise downloading [non-quantile normalized](#about-qu Our examples recommend using DESeq2 for normalizing your RNA-seq data. You may have heard about or worked with FPKM, TPM, RPKMs; how does DESeq2's normalization compare? This [handy table from an online Harvard Bioinformatics Core course nicely summarizes and compares these different methods](https://hbctraining.github.io/DGE_workshop_salmon/lessons/02_DGE_count_normalization.html#common-normalization-methods) [@dge-workshop-deseq2]. -For more about the steps behind DESeq2 normalization, we highly recommend this [StatsQuest video](https://www.youtube.com/watch?v=UFB993xufUU) which explains it quite nicely [@Starmer2017-deseq2]. +For more about the steps behind DESeq2 normalization, we highly recommend this [StatQuest video](https://www.youtube.com/watch?v=UFB993xufUU) which explains it quite nicely [@Starmer2017-deseq2]. To normalize and transform our data with DESeq2, we generally use `vst()` (variance stabilizing transformation) or `rlog()` (regularized logarithm transformation). [Both methods are very similar](http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html#the-variance-stabilizing-transformation-and-the-rlog). @@ -106,7 +106,7 @@ If you end up using a larger dataset and `rlog()` transformation takes a bit too ### Further resources for DESeq2 -- [StatsQuest: DESeq2, part 1, Library Normalization](https://www.youtube.com/watch?v=UFB993xufUU) [@Starmer2017-deseq2]. +- [StatQuest: DESeq2, part 1, Library Normalization](https://www.youtube.com/watch?v=UFB993xufUU) [@Starmer2017-deseq2]. - [DESeq2 vignette: Analyzing RNA-seq data with DESeq2](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) [@Love2014]. - [Beginner's guide to DESeq2](Bhttps://bioc.ism.ac.jp/packages/2.14/bioc/vignettes/DESeq2/inst/doc/beginner.pdf) [@Love2014-guide]. - [Introduction to DGE - Count Normalization](https://hbctraining.github.io/DGE_workshop_salmon/lessons/02_DGE_count_normalization.html) [@dge-workshop-count-normalization] diff --git a/03-rnaseq/00-intro-to-rnaseq.html b/03-rnaseq/00-intro-to-rnaseq.html index 1bce4624..0d5b0cbd 100644 --- a/03-rnaseq/00-intro-to-rnaseq.html +++ b/03-rnaseq/00-intro-to-rnaseq.html @@ -1733,7 +1733,7 @@

0.1.3 About quantile normalizatio

0.1.4 More resources on RNA-seq technology

0.2.2 DESeq2 transformation methods

-

Our examples recommend using DESeq2 for normalizing your RNA-seq data. You may have heard about or worked with FPKM, TPM, RPKMs; how does DESeq2’s normalization compare? This handy table from an online Harvard Bioinformatics Core course nicely summarizes and compares these different methods (Harvard Chan Bioinformatics Core (HBC)). For more about the steps behind DESeq2 normalization, we highly recommend this StatsQuest video which explains it quite nicely (Josh Starmer 2017b).

+

Our examples recommend using DESeq2 for normalizing your RNA-seq data. You may have heard about or worked with FPKM, TPM, RPKMs; how does DESeq2’s normalization compare? This handy table from an online Harvard Bioinformatics Core course nicely summarizes and compares these different methods (Harvard Chan Bioinformatics Core (HBC)). For more about the steps behind DESeq2 normalization, we highly recommend this StatQuest video which explains it quite nicely (Josh Starmer 2017b).

To normalize and transform our data with DESeq2, we generally use vst() (variance stabilizing transformation) or rlog() (regularized logarithm transformation). Both methods are very similar. Both normalize your data by correcting for library size differences but they also transform your data removing the dependence of the variance on the mean, meaning that low mean genes won’t have inflated variance from just one or a few samples having higher values than the rest (Michael I. Love, Simon Anders, and Wolfgang Huber 2020). Of the two methods, rlog() takes a bit longer to run (Michael I. Love and Huber 2019). If you end up using a larger dataset and rlog() transformation takes a bit too long, you can switch to using vst() with confidence since they yield similar results given the dataset is large enough (Michael I. Love and Huber 2019).

0.2.3 Further resources for DESeq2

    -
  • StatsQuest: DESeq2, part 1, Library Normalization (Josh Starmer 2017b).
  • +
  • StatQuest: DESeq2, part 1, Library Normalization (Josh Starmer 2017b).
  • DESeq2 vignette: Analyzing RNA-seq data with DESeq2 (Love et al. 2014).
  • Beginner’s guide to DESeq2 (Michael I. Love and Huber 2014).
  • Introduction to DGE - Count Normalization (Harvard Chan Bioinformatics Core (HBC))
  • diff --git a/03-rnaseq/clustering_rnaseq_01_heatmap.Rmd b/03-rnaseq/clustering_rnaseq_01_heatmap.Rmd index 747a8691..b924160e 100644 --- a/03-rnaseq/clustering_rnaseq_01_heatmap.Rmd +++ b/03-rnaseq/clustering_rnaseq_01_heatmap.Rmd @@ -109,7 +109,7 @@ Copy and paste the `SRP070849` folder into your newly created `data/` folder. Your new analysis folder should contain: -- The example analysis Rmd you downloaded +- The example analysis `.Rmd` you downloaded - A folder called "data" which contains: - The `SRP070849` folder which contains: - The gene expression @@ -312,12 +312,11 @@ We've created a heatmap but although our genes and samples are clustered, there First let's save our clustered heatmap. -### Save heatmap as a png - +### Save heatmap as a PNG You can easily switch this to save to a jpeg or tiff by changing the function and file name within the function to the respective file suffix. ```{r} -# Open a png file +# Open a PNG file png(file.path( plots_dir, "SRP070849_heatmap_non_annotated.png" # Replace file name with a relevant output plot name @@ -326,7 +325,7 @@ png(file.path( # Print your heatmap pheatmap -# Close the png file: +# Close the PNG file: dev.off() ``` @@ -334,7 +333,7 @@ Now, let's add some annotation bars to our heatmap. ## Prepare metadata for annotation -From the accompanying [paper](https://pubmed.ncbi.nlm.nih.gov/28193779/), we know that the mice with `IDH2` mutant AML were treated with vehicle or AG-221 (the first small molecule in vivo inhibitor of IDH2 to enter clinical trials) and the mice with `TET2` mutant AML were treated with vehicle or 5-Azacytidine (Decitabine, hypomethylating agent). [@Shih2017] +From the accompanying [paper](https://pubmed.ncbi.nlm.nih.gov/28193779/), we know that the mice with `IDH2` mutant AML were treated with vehicle or AG-221 (the first small molecule in-vivo inhibitor of IDH2 to enter clinical trials) and the mice with `TET2` mutant AML were treated with vehicle or 5-Azacytidine (Decitabine, hypomethylating agent). [@Shih2017] We are going to manipulate the metadata and add variables with the information for each sample, from the experimental design briefly described above, that we would like to use to annotate the heatmap. ```{r} @@ -387,8 +386,7 @@ Now that we have annotation bars on our heatmap, we have a better idea of the sa Let's save our annotated heatmap. -### Save annotated heatmap as a png - +### Save annotated heatmap as a PNG You can easily switch this to save to a JPEG or TIFF by changing the function and file name within the function to the respective file suffix. ```{r} @@ -401,7 +399,7 @@ png(file.path( # Print your heatmap pheatmap_annotated -# Close the png file: +# Close the PNG file: dev.off() ``` @@ -413,7 +411,7 @@ dev.off() # Session info At the end of every analysis, before saving your notebook, we recommend printing out your session info. -This helps make your code more reproducible by recording what versions of softwares and packages you used to run this. +This helps make your code more reproducible by recording what versions of software and packages you used to run this. ```{r} # Print session info diff --git a/03-rnaseq/clustering_rnaseq_01_heatmap.html b/03-rnaseq/clustering_rnaseq_01_heatmap.html index ebb59bb1..f345bbe6 100644 --- a/03-rnaseq/clustering_rnaseq_01_heatmap.html +++ b/03-rnaseq/clustering_rnaseq_01_heatmap.html @@ -1666,7 +1666,7 @@

    Clustering Heatmap - RNA-seq

    CCDL for ALSF

    -

    September 2020

    +

    October 2020

@@ -1739,7 +1739,7 @@

2.5 Place the dataset in your new

2.6 Check out our file structure!

Your new analysis folder should contain:

    -
  • The example analysis Rmd you downloaded
    +
  • The example analysis .Rmd you downloaded
  • A folder called “data” which contains:
      @@ -1984,9 +1984,9 @@

      4.7 Create a heatmap

      We’ve created a heatmap but although our genes and samples are clustered, there is not much information that we can gather here because we did not provide the pheatmap() function with annotation labels for our samples.

      First let’s save our clustered heatmap.

      -

      4.7.1 Save heatmap as a png

      +

      4.7.1 Save heatmap as a PNG

      You can easily switch this to save to a jpeg or tiff by changing the function and file name within the function to the respective file suffix.

      -
      # Open a png file
      +
      # Open a PNG file
       png(file.path(
         plots_dir,
         "SRP070849_heatmap_non_annotated.png" # Replace file name with a relevant output plot name
      @@ -1995,7 +1995,7 @@ 

      4.7.1 Save heatmap as a png

      # Print your heatmap pheatmap -# Close the png file: +# Close the PNG file: dev.off()
      ## png 
       ##   2
      @@ -2004,7 +2004,7 @@

      4.7.1 Save heatmap as a png

      4.8 Prepare metadata for annotation

      -

      From the accompanying paper, we know that the mice with IDH2 mutant AML were treated with vehicle or AG-221 (the first small molecule in vivo inhibitor of IDH2 to enter clinical trials) and the mice with TET2 mutant AML were treated with vehicle or 5-Azacytidine (Decitabine, hypomethylating agent). (Shih et al. 2017) We are going to manipulate the metadata and add variables with the information for each sample, from the experimental design briefly described above, that we would like to use to annotate the heatmap.

      +

      From the accompanying paper, we know that the mice with IDH2 mutant AML were treated with vehicle or AG-221 (the first small molecule in-vivo inhibitor of IDH2 to enter clinical trials) and the mice with TET2 mutant AML were treated with vehicle or 5-Azacytidine (Decitabine, hypomethylating agent). (Shih et al. 2017) We are going to manipulate the metadata and add variables with the information for each sample, from the experimental design briefly described above, that we would like to use to annotate the heatmap.

      # Let's prepare the annotation data.frame for the uncollapsed `DESeqData` set object which will be used to create the technical replicates heatmap
       annotation_df <- metadata %>%
         # Create a variable to store the cancer type information
      @@ -2049,7 +2049,7 @@ 

      4.8.1 Create annotated heatmapLet’s save our annotated heatmap.

      -

      4.8.2 Save annotated heatmap as a png

      +

      4.8.2 Save annotated heatmap as a PNG

      You can easily switch this to save to a JPEG or TIFF by changing the function and file name within the function to the respective file suffix.

      # Open a PNG file
       png(file.path(
      @@ -2060,7 +2060,7 @@ 

      4.8.2 Save annotated heatmap as a # Print your heatmap pheatmap_annotated -# Close the png file: +# Close the PNG file: dev.off()

      ## png 
       ##   2
      @@ -2076,7 +2076,7 @@

      5 Further learning resources abou

      6 Session info

      -

      At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of softwares and packages you used to run this.

      +

      At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

      # Print session info
       sessionInfo()
      ## R version 4.0.2 (2020-06-22)
      diff --git a/03-rnaseq/differential-expression_rnaseq_01.Rmd b/03-rnaseq/differential-expression_rnaseq_01.Rmd
      index 3fd86c66..22a4870c 100644
      --- a/03-rnaseq/differential-expression_rnaseq_01.Rmd
      +++ b/03-rnaseq/differential-expression_rnaseq_01.Rmd
      @@ -455,7 +455,7 @@ You can take your results from this example and make a heatmap following our hea
       # Session info
       
       At the end of every analysis, before saving your notebook, we recommend printing out your session info. 
      -This helps make your code more reproducible by recording what versions of softwares and packages you used to run this. 
      +This helps make your code more reproducible by recording what versions of software and packages you used to run this. 
       
       ```{r}
       # Print session info
      diff --git a/03-rnaseq/differential-expression_rnaseq_01.html b/03-rnaseq/differential-expression_rnaseq_01.html
      index cb38cc7b..af3d0a0f 100644
      --- a/03-rnaseq/differential-expression_rnaseq_01.html
      +++ b/03-rnaseq/differential-expression_rnaseq_01.html
      @@ -1666,7 +1666,7 @@
       
       

      Differential Expression - RNA-seq

      CCDL for ALSF

      -

      September 2020

      +

      October 2020

      @@ -2052,7 +2052,7 @@

      4.6 Run differential expression a

      4.6.1 Check results by plotting one gene

      To double check what a differentially expressed gene looks like, we can plot one with DESeq2::plotCounts() function.

      plotCounts(ddset, gene = "ENSG00000196074", intgroup = "asxl_mutation_status")
      -

      +

      The mutation group samples have higher expression of this gene than the control group, which helps assure us that the results are showing us what we are looking for.

@@ -2112,7 +2112,7 @@

5 Further learning resources abou

6 Session info

-

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of softwares and packages you used to run this.

+

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

# Print session info
 sessionInfo()
## R version 4.0.2 (2020-06-22)
@@ -2168,10 +2168,10 @@ 

6 Session info

## [64] pillar_1.4.6 geneplotter_1.66.0 XML_3.99-0.5 ## [67] glue_1.4.2 evaluate_0.14 EnhancedVolcano_1.6.0 ## [70] vctrs_0.3.4 gtable_0.3.0 getopt_1.20.3 -## [73] purrr_0.3.4 rematch2_2.1.2 assertthat_0.2.1 -## [76] emdbook_1.3.12 xfun_0.17 xtable_1.8-4 -## [79] coda_0.19-3 survival_3.1-12 tibble_3.0.3 -## [82] AnnotationDbi_1.50.3 memoise_1.1.0 ellipsis_0.3.1
+## [73] purrr_0.3.4 assertthat_0.2.1 emdbook_1.3.12 +## [76] xfun_0.17 xtable_1.8-4 coda_0.19-3 +## [79] survival_3.1-12 tibble_3.0.3 AnnotationDbi_1.50.3 +## [82] memoise_1.1.0 ellipsis_0.3.1

Blighe K., S. Rana, and M. Lewis, 2020 EnhancedVolcano: Publication-ready volcano plots with enhanced colouring and labeling.

diff --git a/03-rnaseq/dimension-reduction_rnaseq_01_pca.Rmd b/03-rnaseq/dimension-reduction_rnaseq_01_pca.Rmd index 3babf6de..740a9009 100644 --- a/03-rnaseq/dimension-reduction_rnaseq_01_pca.Rmd +++ b/03-rnaseq/dimension-reduction_rnaseq_01_pca.Rmd @@ -115,7 +115,7 @@ Copy and paste the `SRP133573` folder into your newly created `data/` folder. Your new analysis folder should contain: -- The example analysis Rmd you downloaded +- The example analysis `.Rmd` you downloaded - A folder called "data" which contains: - The `SRP133573` folder which contains: - The gene expression @@ -347,12 +347,12 @@ ggsave(file.path(plots_dir, "SRP133573_pca_plot.png"), # Replace with name relev - [Principle Component Analysis (PCA) Explained Visually](http://setosa.io/ev/principal-component-analysis/) [@pca-visually-explained] - [Guidelines on choosing dimension reduction methods](https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1006907&type=printable) [@Nguyen2019] -- [A nice explanation and comparison of many different dimenstionality reduction techniques that you may encounter](https://rpubs.com/Saskia/520216) [@Freytag2019] +- [A nice explanation and comparison of many different dimensionality reduction techniques that you may encounter](https://rpubs.com/Saskia/520216) [@Freytag2019] # Print session info At the end of every analysis, before saving your notebook, we recommend printing out your session info. -This helps make your code more reproducible by recording what versions of softwares and packages you used to run this. +This helps make your code more reproducible by recording what versions of software and packages you used to run this. ```{r} # Print session info diff --git a/03-rnaseq/dimension-reduction_rnaseq_01_pca.html b/03-rnaseq/dimension-reduction_rnaseq_01_pca.html index bc2cf51c..ca0e1950 100644 --- a/03-rnaseq/dimension-reduction_rnaseq_01_pca.html +++ b/03-rnaseq/dimension-reduction_rnaseq_01_pca.html @@ -1666,7 +1666,7 @@

PCA Visualization - RNA-seq

CCDL for ALSF

-

September 2020

+

October 2020

@@ -1743,7 +1743,7 @@

2.5 Place the dataset in your new

2.6 Check out our file structure!

Your new analysis folder should contain: