Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add microarray gene ID conversion example #212

Merged
merged 19 commits into from
Sep 18, 2020
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 23 additions & 43 deletions 02-microarray/gene-id-annotation_microarray_01_ensembl.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -141,10 +141,10 @@ We suggest saving plots and results to `plots/` and `results/` directories, resp
From here you can customize this analysis example to fit your own scientific questions and preferences.

refine.bio data comes with gene level data with Ensembl IDs.
Although this example script uses Ensembl IDs from Mouse, (<i>Mus musculus</i>), to obtain gene symbols this script can be easily converted for use with different species or annotation types <i>eg</i> protein IDs, gene ontology, accession numbers.
Although this example notebook uses Ensembl IDs from Mouse, (<i>Mus musculus</i>), to obtain gene symbols this notebook can be easily converted for use with different species or annotation types <i>eg</i> protein IDs, gene ontology, accession numbers.

<b>For different species</b>, wherever the abbreviation `org.Mm.eg.db` or `Mm` is written, it must be replaced with the respective species abbreviation <i>eg</i> for<i> Homo sapiens</i> `org.Hs.eg.db` or `Hs` would be used.
In the case of our [RNA-seq gene identifier annotation example notebook](https://github.com/AlexsLemonade/refinebio-examples/blob/master/03-rnaseq/gene-id-convert_rnaseq_01_ensembl.Rmd), a Zebrafish (<i>Danio rerio</i>) dataset is used, meaning `org.Dr.eg.db` or `Dr` would also need to be used there.
In the case of our [RNA-seq gene identifier annotation example notebook](https://github.com/AlexsLemonade/refinebio-examples/blob/master/03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html), a Zebrafish (<i>Danio rerio</i>) dataset is used, meaning `org.Dr.eg.db` or `Dr` would also need to be used there.
A full list of the annotation R packages from Bioconductor is at this [link](https://bioconductor.org/packages/release/BiocViews.html#___AnnotationData) [@annotation-packages].

***
Expand All @@ -162,6 +162,7 @@ Let's get ready to use the Ensembl IDs from our mouse dataset to obtain the asso
See our Getting Started page with [instructions for package installation](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#what-you-need-to-install) for a list of the other software you will need, as well as more tips and resources.

In this analysis, we will be using the `org.Mm.eg.db` R package [@Carlson2019].
cbethell marked this conversation as resolved.
Show resolved Hide resolved
[Other species can be used](#using-a-different-refinebio-dataset-with-this-analysis).

```{r}
# Install the mouse package
Expand Down Expand Up @@ -220,7 +221,6 @@ df <- df %>%

## Map Ensembl IDs to gene symbols

This code chunk will return a message that indicates that there are many mappings between our keys and columns but this is okay and is what we expect in most cases since genes and transcripts do not have a 1:1 relationship.
The `mapIds()` function has a `multiVals` argument which denotes what to do when there are multiple mapped values for a single gene identifier.
cbethell marked this conversation as resolved.
Show resolved Hide resolved
The default behavior is to return just the first mapped value.
It is good to keep in mind that various downstream analyses may benefit from varied strategies at this step.
Expand Down Expand Up @@ -255,7 +255,8 @@ We are going to turn our list object into a data frame object in the next chunk.
```{r}
# Let's make our object a bit more manageable for exploration by turning it into a data frame using the `reshape2::melt()` function
mapped_df <- mapped_list %>%
reshape2::melt()
reshape2::melt(value.name = "Symbol") %>%
dplyr::rename("Ensembl" = "L1") # Here we are renaming the column holding the Ensembl IDs as good practice
```

Now let's take a peek at our data frame.
Expand All @@ -264,12 +265,12 @@ Now let's take a peek at our data frame.
head(mapped_df)
```

We can see that our data frame has a new column `value`.
Let's get a summary of the gene symbols returned in the `value` column of our mapped data frame.
We can see that our data frame has a new column `Symbol`.
Let's get a summary of the gene symbols returned in the `Symbol` column of our mapped data frame.

```{r}
# We can use the `summary()` function to get a better idea of the distribution of symbols in the `value` column
summary(mapped_df$value)
# We can use the `summary()` function to get a better idea of the distribution of symbols in the `Symbol` column
summary(mapped_df$Symbol)
```

There are 998 NA's in our data frame, which means that 998 out of the 17918 Ensembl IDs did not map to gene symbols.
Expand All @@ -282,8 +283,8 @@ Now let's check to see if we have any genes that were mapped to multiple symbols

```{r}
multi_mapped <- mapped_df %>%
# Let's group by the Ensembl ID's in the `L1` column
dplyr::group_by(L1) %>%
# Let's group by the Ensembl IDs in the `Ensembl` column
dplyr::group_by(Ensembl) %>%
# Create a new variable containing the number of symbols mapped to each ID
dplyr::mutate(gene_symbol_count = dplyr::n()) %>%
# Arrange by the genes with the highest number of symbols mapped
Expand All @@ -297,47 +298,27 @@ head(multi_mapped)

Looks like we have some cases where 3 gene symbols mapped to a single Ensembl ID.
We have a total of 130 out of 17984 Ensembl IDs with multiple mappings to gene symbols.
We are not too worried about the 130 IDs with multiple mappings, so we will filter them out for the purpose of having 1:1 mappings for our downstream analysis.
If we are not too worried about the 130 IDs with multiple mappings, we can filter them out for the purpose of having 1:1 mappings for our downstream analysis.

## Map Ensembl IDs to gene symbols -- filtering out multi mappings

The next code chunk will rerun the `mapIds()` function, this time supplying the `"filter"` option to the `multiVals` argument.
The next code we chunk we will rerun the `mapIds()` function, this time supplying the `"filter"` option to the `multiVals` argument.
cbethell marked this conversation as resolved.
Show resolved Hide resolved
This will remove all instances of multiple mappings and return a list of only the gene identifiers and symbols that had 1:1 mapping.
Use `?mapIds` to see more options or strategies.

```{r}
# Map ensembl IDs to their associated gene symbols
filtered_mapped_df <- mapIds(
org.Mm.eg.db, # Replace with annotation package for the organism relevant to your data
keys = df$Gene,
column = "SYMBOL", # Replace with the type of gene identifiers you would like to map to
keytype = "ENSEMBL", # Replace with the type of gene identifiers in your data
multiVals = "filter"
) %>%
# Turn our list into a data frame
reshape2::melt() %>%
# Now let's give the `value` column a name that is more relevant to the values it holds
dplyr::rename("Symbol" = "value")
```

Let's copy the same code we used above to see if we have any multi mappings in our new data frame.

```{r}
filtered_mapped_df %>%
# First, we need to grab the Ensembl IDs from the rownames
tibble::rownames_to_column("Gene") %>%
# Let's group by the Ensembl ID's in the `L1` column
dplyr::group_by(Gene) %>%
# Create a new variable containing the number of symbols mapped to each ID
dplyr::mutate(gene_symbol_count = dplyr::n()) %>%
# Arrange by the genes with the highest number of symbols mapped
dplyr::arrange(desc(gene_symbol_count)) %>%
# Filter to include only the rows with multi mappings
dplyr::filter(gene_symbol_count > 1)
filtered_mapped_df <- data.frame(
cbethell marked this conversation as resolved.
Show resolved Hide resolved
cbethell marked this conversation as resolved.
Show resolved Hide resolved
"Symbol" = mapIds(
org.Mm.eg.db, # Replace with annotation package for the organism relevant to your data
keys = df$Gene,
column = "SYMBOL", # Replace with the type of gene identifiers you would like to map to
keytype = "ENSEMBL", # Replace with the type of gene identifiers in your data
multiVals = "filter"
cbethell marked this conversation as resolved.
Show resolved Hide resolved
)
)
```

This new new `filtered_mapped_df` object is empty, as expected, since it has been filtered to include only 1:1 mappings.

Now, let's write our filtered and mapped results to file!

## Write mapped results to file
Expand All @@ -354,8 +335,7 @@ readr::write_tsv(filtered_mapped_df, file.path(
# Resources for further learning

- Marc Carlson has prepared a nice [Introduction to Bioconductor Annotation Packages](https://bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf) [@Carlson2020]
- [More on Annotation mapping functions](https://www.bioconductor.org/packages/release/bioc/vignettes/AnnotationFuncs/inst/doc/AnnotationFuncsUserguide.pdf) [@Edwards2020]
- See our [RNA-seq gene ID conversion notebook](https://github.com/AlexsLemonade/refinebio-examples/blob/master/03-rnaseq/gene-id-convert_rnaseq_01_ensembl.Rmd) as another applicable example, since the steps for this workflow do not change with technology [@gene-id-annotation-rna-seq].
- See our [RNA-seq gene ID conversion notebook](https://github.com/AlexsLemonade/refinebio-examples/blob/master/03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html) as another applicable example, since the steps for this workflow do not change with technology [@gene-id-annotation-rna-seq].

# Session info

Expand Down
Loading