Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create clustering example with RNA-seq data module #127

Closed
cansavvy opened this issue Jul 9, 2020 · 26 comments
Closed

Create clustering example with RNA-seq data module #127

cansavvy opened this issue Jul 9, 2020 · 26 comments
Assignees

Comments

@cansavvy
Copy link
Contributor

cansavvy commented Jul 9, 2020

Related to #110

Currently we have a microarray example of clustering with ComplexHeatmap. We should be able to largely copy the microarray example but switch out the dataset and the accompanying wording.

When switching to an RNA-seq dataset we should also keep in mind whether the strategy of keeping genes with the largest variances is still appropriate (I think the answer should be yes). But perhaps in addition we'll want a minimum TPMs cutoff?

This can be done separately from the "getting started" additions to each analysis #116

@cansavvy cansavvy changed the title Create clustering with RNA-seq data module Create clustering example with RNA-seq data module Jul 9, 2020
@cansavvy
Copy link
Contributor Author

cansavvy commented Jul 10, 2020

Here's a DESeq2 example that might be good to borrow from: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#heatmap-of-the-count-matrix

However this would involve switching from ComplexHeatmap -> pheatmap. I'm not opposed to this because I do think ComplexHeatmap could be overwhelming for many users.

We may consider making a "basic heatmap" and "advanced heatmap" where we use pheatmap and ComplexHeatmap respectively. - This would be a future thing.

@cansavvy
Copy link
Contributor Author

I also think we should reconsider renaming this module to be more reflective of the central goal which is a heatmap not really clustering. We could rename it to clustered-heatmap so the word heatmap is actually in the title. Additionally we should tell people how to shut off the clustering in case the heatmap part is all they really wanted.

@cansavvy
Copy link
Contributor Author

Here we should go with the Non-QN'ed data -> DESeq2 normalization -> use assay() to extract the normalized data.

@cbethell
Copy link
Contributor

cbethell commented Jul 11, 2020

Based on @cansavvy's comments above, my plan is to:

  1. Copy the microarray clustering notebook and rename the module and notebook to include the word heatmap.
  2. Find a suitable RNA-Seq dataset (with non-QN'ed data), place it into the data directory, and read that into the notebook.
  3. Add a step to define a minimum counts cutoff.
  4. Perform the DESeq2 normalization (using http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#heatmap-of-the-count-matrix as inspiration to borrow from) and use assay() to further analyze the now normalized data.
  5. Keep the section that chooses genes of interest using variances.
  6. Use the pheatmap package to create a basic clustered heatmap (perhaps add an additional heatmap to show that the clustering can be turned off, although comments may be suffice here).

Does this plan appear to fall in line with your above thoughts @cansavvy?

@cansavvy
Copy link
Contributor Author

Does this plan appear to fall in line with your above thoughts @cansavvy?

This sounds good to me!

@jaclyn-taroni
Copy link
Member

If you have transformed values as you describe, you will not have TPMs. If you are concerned about low total counts, I think you could probably add a step before transformation.

@cbethell
Copy link
Contributor

If you have transformed values as you describe, you will not have TPMs. If you are concerned about low total counts, I think you could probably add a step before transformation.

Gotcha, I updated the plan in my comment above. Does that appear to be more logical?

@jaclyn-taroni
Copy link
Member

Change TPM to counts and it looks set to me

@cbethell
Copy link
Contributor

I have encountered the following error in working on getting the PR related to this issue ready:

Error in checkForExperimentalReplicates(object, modelMatrix) : 
The design matrix has the same number of samples and coefficients to fit,
so estimation of dispersion is not possible.
Treating samples as replicates was deprecated in v1.20 and no longer supported since v1.22.

This occurs at the DESeq execution step. I have downloaded various non-QN'ed refine bio datasets, tried installing the older version of the package and learned a great deal more about the DESeq2 package but so far I have been encountering the same issue.

I have one more idea that I am about to implement, but in the meantime, I wanted to find out if anyone has suggestions for this particular situation.

@jaclyn-taroni
Copy link
Member

jaclyn-taroni commented Jul 15, 2020

A few comments/questions to potentially get you started:

  • Would not expect you to have to run DESeq() for this module - what step does that correspond to in what you outlined above?
  • What design argument are you using when creating the DESeqDataSet? My guess is that is where the trouble is based on the error and not seeking any additional info.

@cbethell
Copy link
Contributor

cbethell commented Jul 15, 2020

A few comments/questions to potentially get you started:

  • Would not expect you to have to run DESeq() for this module - what step does that correspond to in what you outlined above?

Ah, thank you @jaclyn-taroni, I misinterpreted that part of the vignette mentioned in the comment with my plan for this issue, I was able to create the DESeqDataSet object and thought that I needed to run the DESeq function to then carry out the normalization, but instead I was able to use assay to get the values from the DESeqDataSet object and normalize those values.

  • What design argument are you using when creating the DESeqDataSet? My guess is that is where the trouble is based on the error and not seeking any additional info.

I use therefinbio_title argument as the design.

@cansavvy
Copy link
Contributor Author

Glad you figured out about the DESeqDataSet object part.

I use the refinebio_title argument as the design.

I think we would still want to stick with the experimental_grouping variable here. Lemme check the notebook and get back to you.

@jaclyn-taroni
Copy link
Member

Generally speaking, I would expect dataset creation from matrix -> transformation -> assay()

I use the refinebio_title argument as the design.

That will, in many but probably not all cases, be a unique value. So DESeq() was trying to do differential expression analysis where there are no replicates.

I think we would still want to stick with the experimental_grouping variable here. Lemme check the notebook and get back to you.

I would expect the transformation could be blinded to the experimental design, so the design may not be important (it depends™️ on the experiment)

@cbethell
Copy link
Contributor

Generally speaking, I would expect dataset creation from matrix -> transformation -> assay()

Gotcha and noted 👍

I use the refinebio_title argument as the design.

That will, in many but probably not all cases, be a unique value. So DESeq() was trying to do differential expression analysis where there are no replicates.

Of course, that makes total sense.

I think we would still want to stick with the experimental_grouping variable here. Lemme check the notebook and get back to you.

I would expect the transformation could be blinded to the experimental design, so the design may not be important (it depends™️ on the experiment)

👍

@cansavvy
Copy link
Contributor Author

cansavvy commented Jul 15, 2020

I think we still want to stick with a SP vs MP split like we did before:

HeatmapAnnotation(df = data.frame(Groups = rep(c("SP", "MP"), # Replace with group names relevant to your data

We may want to debate about whether this manner of making the variable is the best to suggest (feels a tad precarious).

The default DESEqDataset will take the last variable (which isn't what we want) but you can also specify @jaclyn-taroni edit: -1 ~ 1 to make it no model: https://www.rdocumentation.org/packages/DESeq2/versions/1.12.3/topics/DESeqDataSet-class

Regardless of whether you incorporate the SP/MP into the DESeq2Dataset object, we will still want those labels for the heatmap annotation.

@cbethell
Copy link
Contributor

I think we still want to stick with a SP vs MP split like we did before:

HeatmapAnnotation(df = data.frame(Groups = rep(c("SP", "MP"), # Replace with group names relevant to your data

We may want to debate about whether this manner of making the variable is the best to suggest (feels a tad precarious).

I'll see if I can come up with a relatively less precarious method along the way.

The default DESEqDataset will take the last variable (which isn't what we want) but you can also specify -1 to make it no model: https://www.rdocumentation.org/packages/DESeq2/versions/1.12.3/topics/DESeqDataSet-class

Regardless of whether you incorporate the SP/MP into the model, we will still want those labels for the heatmap annotation.

Ah okay, gotcha

@cbethell
Copy link
Contributor

But now that we have the ball rolling, I think the bigger change we may want to make is to use a dataset with more samples since a clustering of n = 4 is not super helpful

In respect to @cansavvy's suggestion above taken from a comment on the open PR link to this issue, I have tested out a number of larger datasets and below I will list two of the datasets I tested that seemed to cluster a bit better:

Regulation of myeloid function by LC3-associated phagocytosis promotes tumor immune tolerance
Screen Shot 2020-07-20 at 10 37 32 AM

Transcriptome analysis of Group 3 and 4 medulloblastoma orthotopic xenograft mice with digoxin treatment

Screen Shot 2020-07-20 at 10 46 30 AM

Do any of these seem to be more suitable for this particular use case? If not, what would you like to see in a dataset for the notebook in this issue?

@cansavvy
Copy link
Contributor Author

cansavvy commented Jul 20, 2020

I like the general vibes of Transcriptome analysis of Group 3 and 4 medulloblastoma orthotopic xenograft mice with digoxin treatment dataset out of the two of these.

  • n = 24 seems manageable, but big enough we can do some clustering.
  • Medulloblastoma groups are topical for our audience.

But I have one qualm/question: Can we find out if some of these / which of these are replicates versus different animals completely?

This Title: Icb-2555MB-20908_S23_L003 makes me think S23 could be a sample ID? L00* is a lane number.
@cbethell, can you scan through the paper and try to find out what the actual n is here for number of animals vs replicates? It may be buried in some supplementary methods addendum (don't spend too much time on this if you can't find it). If we can't figure out this information, I don't want to prescribe statistics for it. But we can move on to the next dataset.

However, if we can figure it out the experimental model, and it still seems like a decent enough for clustering, it could be a good example of making sure to take into account replicates in a responsible manner.

@cbethell
Copy link
Contributor

I like the general vibes of Transcriptome analysis of Group 3 and 4 medulloblastoma orthotopic xenograft mice with digoxin treatment dataset out of the two of these.

  • n = 24 seems manageable, but big enough we can do some clustering.
  • Medulloblastoma groups are topical for our audience.

But I have one qualm/question: Can we find out if some of these / which of these are replicates versus different animals completely?

This Title: Icb-2555MB-20908_S23_L003 makes me think S23 could be a sample ID? L00* is a lane number.
@cbethell, can you scan through the paper and try to find out what the actual n is here for number of animals vs replicates? It may be buried in some supplementary methods addendum (don't spend too much time on this if you can't find it). If we can't figure out this information, I don't want to prescribe statistics for it. But we can move on to the next dataset.

However, if we can figure it out the experimental model, and it still seems like a decent enough for clustering, it could be a good example of making sure to take into account replicates in a responsible manner.

@cansavvy I did not explicitly find information on which of the samples are replicates in the paper, but they did note that they "chose 10 mice per group (digoxin-treated and controls)".

@cansavvy
Copy link
Contributor Author

@cansavvy I did not explicitly find information on which of the samples are replicates in the paper, but they did note that they "chose 10 mice per group (digoxin-treated and controls)".

They didn't make it easy. Their supplement never includes a table that explains their samples/mice. But I've decoded it based on Table S9 and doing some string manipulations. There are 6 mice in the final experiment and 4 replicates of each.

Turns out the first part is the treament: e.g. Icb-2555MB and the second part is the mouse id: eg. 20908_S23_L003 Though the lanes are inconsistently included for reasons I don't understand.

I did this sloppy thing to check if this was right.

summary(as.factor(stringr::word(metadata$refinebio_title, sep = "-|_", 3)))

If we are going to continue with this dataset, I would like to know how the replicates cluster (or don't cluster) together.

The question is whether we think it's useful to continue digging into this dataset or if we should look for something else. Let's chat about it briefly after/during the science team meeting.

@cansavvy
Copy link
Contributor Author

This is a reminder that once we nail down what dataset we will use here, we will need to discuss what the appropriate variable is to supply to the design argument of DESeq2DatasetFromMatrix function. This will be dependent on experimental design.

@cansavvy
Copy link
Contributor Author

Per our discussion in science team meeting, I think we should move forward with the xenograft mouse medulloblastoma dataset you've found here, @cbethell , so as you move along, we'll just need to discuss what's the best way to model dealing with replicates in clustering. This probably starts with determining what combination of variables need to be included in the design argument of the DESeq2Dataset.

Although this may not affect the normalization step per se (may want to look in the vst() or rlog() documentation to confirm this), we want to be modelling best practices to our users of these examples at all times. i.e. If a person returned to the DESeq2Datasetwe created in this module to perform differential expression, it would be in their best interest to set up the design correctly from the get go, so they wouldn't accidentally perform DE on an improper model design.

For this xenograft mouse medulloblastoma dataset, we will need to involve the replicates in the design. Taking another look at the metadata that is included on GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE115542
It looks like these are technical replicates as opposed to biological replicates. Technical replicates meaning it is the same animal, sample, and library but split across multiple lanes for sequencing. I am 95% sure this is what is happening with these samples. You may want to take a look at the supplementary methods and GEO entries for yourself to see if you agree.

In this case, we should look into using collapseReplicates. See manual about this: http://bioconductor.riken.jp/packages/3.5/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#:~:text=(dds%24condition)-,Collapsing%20technical%20replicates,runs%20of%20the%20same%20library.
Here's the collapseReplicates docs page: https://rdrr.io/bioc/DESeq2/man/collapseReplicates.html

We need to figure out what this might look like in the case of this clustering example. I think our goal would be to:

  1. Cluster samples with replicates labeled - ask do the technical replicates cluster together?
  2. Cluster a second time and ask the biology relevant question: do the medulloblastoma model groups cluster?

This being said, I don't know enough about how collapseReplicates() works (maybe @jaclyn-taroni does) so we need to determine this first:

  • Do the collapsed replicates get taken into account for normalization?
  • If no, this makes our jobs easier, because we can normalize once and cluster from the same DESEq2Dataset object twice and just add another layer or labels.
  • If yes, then we will need to make two DESEq2Datasets that normalized with collapseReplicates() and without and then see how the clustering changes.

If it's only at the differential expression step that collapseReplicates() "matter", then we should revisit this on #137

@cbethell
Copy link
Contributor

Per our discussion in science team meeting, I think we should move forward with the xenograft mouse medulloblastoma dataset you've found here, @cbethell , so as you move along, we'll just need to discuss what's the best way to model dealing with replicates in clustering. This probably starts with determining what combination of variables need to be included in the design argument of the DESeq2Dataset.

Although this may not affect the normalization step per se (may want to look in the vst() or rlog() documentation to confirm this), we want to be modelling best practices to our users of these examples at all times. i.e. If a person returned to the DESeq2Datasetwe created in this module to perform differential expression, it would be in their best interest to set up the design correctly from the get go, so they wouldn't accidentally perform DE on an improper model design.

For this xenograft mouse medulloblastoma dataset, we will need to involve the replicates in the design. Taking another look at the metadata that is included on GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE115542
It looks like these are technical replicates as opposed to biological replicates. Technical replicates meaning it is the same animal, sample, and library but split across multiple lanes for sequencing. I am 95% sure this is what is happening with these samples. You may want to take a look at the supplementary methods and GEO entries for yourself to see if you agree.

I took a look and agree with you on what's happening here.

In this case, we should look into using collapseReplicates. See manual about this: http://bioconductor.riken.jp/packages/3.5/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#:~:text=(dds%24condition)-,Collapsing%20technical%20replicates,runs%20of%20the%20same%20library.
Here's the collapseReplicates docs page: https://rdrr.io/bioc/DESeq2/man/collapseReplicates.html

We need to figure out what this might look like in the case of this clustering example. I think our goal would be to:

  1. Cluster samples with replicates labeled - ask do the technical replicates cluster together?
  2. Cluster a second time and ask the biology relevant question: do the medulloblastoma model groups cluster?

This being said, I don't know enough about how collapseReplicates() works (maybe @jaclyn-taroni does) so we need to determine this first:

  • Do the collapsed replicates get taken into account for normalization?
  • If no, this makes our jobs easier, because we can normalize once and cluster from the same DESEq2Dataset object twice and just add another layer or labels.
  • If yes, then we will need to make two DESEq2Datasets that normalized with collapseReplicates() and without and then see how the clustering changes.

My findings suggest that the answer here is yes, so in my last commit over on open PR #140, I implemented the collapseReplicates() function and normalized the collapsed and uncollapsed datasets, then created separate heatmaps for the collapsed and uncollapsed datasets. The design argument of the DESeq2Dataset is still questionable but you can take a look to see if I successfully implemented what you describe here @cansavvy.

@cansavvy
Copy link
Contributor Author

@cbethell , feel free to finish making the rest of the changes for PR #140 , and then we can go back and look at the details for the design argument. Ping me when you are ready for a review.

@cbethell
Copy link
Contributor

cbethell commented Aug 5, 2020

As discussed with @jaclyn-taroni, the connected PR #140 has been closed due to the overwhelming amount of comments that have accumulated over the course of the PR.

The plan now is to open a new draft PR with one of the datasets that were posted in the last comment on PR #140. Per @jaclyn-taroni's comment on PR #140, I will choose the dataset I believe is simpler/less complex of the two based on the amount of metadata smoothing required. The simpler dataset also means that the collapse replicates steps will be removed and shorten the overall length of the clustering notebook.

Said draft PR will include a more simplified code for the clustering example notebook, that would start with the heatmap, then deal with the metadata and annotation bars one by one. These changes will be solely code related upon the opening of the draft PR, writing/context will be refined after.

@cbethell
Copy link
Contributor

This ticket has been addressed in merged PR #151 and can be closed. 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants