AlexsLemonade · jaclyn-taroni · Sep 18, 2020 · Sep 10, 2020 · Sep 10, 2020 · Sep 11, 2020
diff --git a/02-microarray/gene-id-annotation_microarray_01_ensembl.Rmd b/02-microarray/gene-id-annotation_microarray_01_ensembl.Rmd
@@ -141,10 +141,10 @@ We suggest saving plots and results to `plots/` and `results/` directories, resp
 From here you can customize this analysis example to fit your own scientific questions and preferences. 
 
 refine.bio data comes with gene level data with Ensembl IDs. 
-Although this example script uses Ensembl IDs from Mouse, (<i>Mus musculus</i>), to obtain gene symbols this script can be easily converted for use with different species or annotation types <i>eg</i> protein IDs, gene ontology, accession numbers.
+Although this example notebook uses Ensembl IDs from Mouse, (<i>Mus musculus</i>), to obtain gene symbols this notebook can be easily converted for use with different species or annotation types <i>eg</i> protein IDs, gene ontology, accession numbers.
 
 <b>For different species</b>, wherever the abbreviation `org.Mm.eg.db` or `Mm` is written, it must be replaced with the respective species abbreviation <i>eg</i> for<i> Homo sapiens</i> `org.Hs.eg.db` or `Hs` would be used.
-In the case of our [RNA-seq gene identifier annotation example notebook](https://github.com/AlexsLemonade/refinebio-examples/blob/master/03-rnaseq/gene-id-convert_rnaseq_01_ensembl.Rmd), a Zebrafish (<i>Danio rerio</i>) dataset is used, meaning `org.Dr.eg.db` or `Dr` would also need to be used there.
+In the case of our [RNA-seq gene identifier annotation example notebook](https://github.com/AlexsLemonade/refinebio-examples/blob/master/03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html), a Zebrafish (<i>Danio rerio</i>) dataset is used, meaning `org.Dr.eg.db` or `Dr` would also need to be used there.
 A full list of the annotation R packages from Bioconductor is at this [link](https://bioconductor.org/packages/release/BiocViews.html#___AnnotationData) [@annotation-packages].
 
 ***
@@ -162,6 +162,7 @@ Let's get ready to use the Ensembl IDs from our mouse dataset to obtain the asso
 See our Getting Started page with [instructions for package installation](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#what-you-need-to-install) for a list of the other software you will need, as well as more tips and resources.
 
 In this analysis, we will be using the `org.Mm.eg.db` R package [@Carlson2019].
+[Other species can be used](#using-a-different-refinebio-dataset-with-this-analysis).
 
 ```{r}
 # Install the mouse package
@@ -220,7 +221,6 @@ df <- df %>%
 
 ## Map Ensembl IDs to gene symbols
 
-This code chunk will return a message that indicates that there are many mappings between our keys and columns but this is okay and is what we expect in most cases since genes and transcripts do not have a 1:1 relationship.
 The `mapIds()` function has a `multiVals` argument which denotes what to do when there are multiple mapped values for a single gene identifier.
 The default behavior is to return just the first mapped value.
 It is good to keep in mind that various downstream analyses may benefit from varied strategies at this step. 
@@ -255,7 +255,8 @@ We are going to turn our list object into a data frame object in the next chunk.
 ```{r}
 # Let's make our object a bit more manageable for exploration by turning it into a data frame using the `reshape2::melt()` function
 mapped_df <- mapped_list %>%
-  reshape2::melt()
+  reshape2::melt(value.name = "Symbol") %>%
+  dplyr::rename("Ensembl" = "L1") # Here we are renaming the column holding the Ensembl IDs as good practice
 ```
 
 Now let's take a peek at our data frame.
@@ -264,12 +265,12 @@ Now let's take a peek at our data frame.
 head(mapped_df)
 ```
 
-We can see that our data frame has a new column `value`.
-Let's get a summary of the gene symbols returned in the `value` column of our mapped data frame.
+We can see that our data frame has a new column `Symbol`.
+Let's get a summary of the gene symbols returned in the `Symbol` column of our mapped data frame.
 
 ```{r}
-# We can use the `summary()` function to get a better idea of the distribution of symbols in the `value` column
-summary(mapped_df$value)
+# We can use the `summary()` function to get a better idea of the distribution of symbols in the `Symbol` column
+summary(mapped_df$Symbol)
 ```
 
 There are 998 NA's in our data frame, which means that 998 out of the 17918 Ensembl IDs did not map to gene symbols.
@@ -282,8 +283,8 @@ Now let's check to see if we have any genes that were mapped to multiple symbols
 
 ```{r}
 multi_mapped <- mapped_df %>%
-  # Let's group by the Ensembl ID's in the `L1` column
-  dplyr::group_by(L1) %>%
+  # Let's group by the Ensembl IDs in the `Ensembl` column
+  dplyr::group_by(Ensembl) %>%
   # Create a new variable containing the number of symbols mapped to each ID
   dplyr::mutate(gene_symbol_count = dplyr::n()) %>%
   # Arrange by the genes with the highest number of symbols mapped
@@ -297,47 +298,27 @@ head(multi_mapped)
 
 Looks like we have some cases where 3 gene symbols mapped to a single Ensembl ID.
 We have a total of 130 out of 17984 Ensembl IDs with multiple mappings to gene symbols.
-We are not too worried about the 130 IDs with multiple mappings, so we will filter them out for the purpose of having 1:1 mappings for our downstream analysis.
+If we are not too worried about the 130 IDs with multiple mappings, we can filter them out for the purpose of having 1:1 mappings for our downstream analysis.
 
 ## Map Ensembl IDs to gene symbols -- filtering out multi mappings
 
-The next code chunk will rerun the `mapIds()` function, this time supplying the `"filter"` option to the `multiVals` argument.
+The next code we chunk we will rerun the `mapIds()` function, this time supplying the `"filter"` option to the `multiVals` argument.
 This will remove all instances of multiple mappings and return a list of only the gene identifiers and symbols that had 1:1 mapping.
 Use `?mapIds` to see more options or strategies.
 
 ```{r}
 # Map ensembl IDs to their associated gene symbols
-filtered_mapped_df <- mapIds(
-  org.Mm.eg.db, # Replace with annotation package for the organism relevant to your data
-  keys = df$Gene,
-  column = "SYMBOL", # Replace with the type of gene identifiers you would like to map to
-  keytype = "ENSEMBL", # Replace with the type of gene identifiers in your data
-  multiVals = "filter"
-) %>%
-  # Turn our list into a data frame
-  reshape2::melt() %>%
-  # Now let's give the `value` column a name that is more relevant to the values it holds
-  dplyr::rename("Symbol" = "value")
-```
-
-Let's copy the same code we used above to see if we have any multi mappings in our new data frame.
-
-```{r}
-filtered_mapped_df %>%
-  # First, we need to grab the Ensembl IDs from the rownames
-  tibble::rownames_to_column("Gene") %>%
-  # Let's group by the Ensembl ID's in the `L1` column
-  dplyr::group_by(Gene) %>%
-  # Create a new variable containing the number of symbols mapped to each ID
-  dplyr::mutate(gene_symbol_count = dplyr::n()) %>%
-  # Arrange by the genes with the highest number of symbols mapped
-  dplyr::arrange(desc(gene_symbol_count)) %>%
-  # Filter to include only the rows with multi mappings
-  dplyr::filter(gene_symbol_count > 1)
+filtered_mapped_df <- data.frame(
+  "Symbol" = mapIds(
+    org.Mm.eg.db, # Replace with annotation package for the organism relevant to your data
+    keys = df$Gene,
+    column = "SYMBOL", # Replace with the type of gene identifiers you would like to map to
+    keytype = "ENSEMBL", # Replace with the type of gene identifiers in your data
+    multiVals = "filter"
+  )
+)
 ```
 
-This new new `filtered_mapped_df` object is empty, as expected, since it has been filtered to include only 1:1 mappings.
-
 Now, let's write our filtered and mapped results to file!
 
 ## Write mapped results to file
@@ -354,8 +335,7 @@ readr::write_tsv(filtered_mapped_df, file.path(
 # Resources for further learning
 
 - Marc Carlson has prepared a nice [Introduction to Bioconductor Annotation Packages](https://bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf) [@Carlson2020]
-- [More on Annotation mapping functions](https://www.bioconductor.org/packages/release/bioc/vignettes/AnnotationFuncs/inst/doc/AnnotationFuncsUserguide.pdf) [@Edwards2020]
-- See our [RNA-seq gene ID conversion notebook](https://github.com/AlexsLemonade/refinebio-examples/blob/master/03-rnaseq/gene-id-convert_rnaseq_01_ensembl.Rmd) as another applicable example, since the steps for this workflow do not change with technology [@gene-id-annotation-rna-seq].
+- See our [RNA-seq gene ID conversion notebook](https://github.com/AlexsLemonade/refinebio-examples/blob/master/03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html) as another applicable example, since the steps for this workflow do not change with technology [@gene-id-annotation-rna-seq].
 
 # Session info