-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create clustering example with RNA-seq data module #127
Comments
Here's a DESeq2 example that might be good to borrow from: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#heatmap-of-the-count-matrix However this would involve switching from We may consider making a "basic heatmap" and "advanced heatmap" where we use |
I also think we should reconsider renaming this module to be more reflective of the central goal which is a heatmap not really clustering. We could rename it to |
Here we should go with the Non-QN'ed data -> DESeq2 normalization -> use |
Based on @cansavvy's comments above, my plan is to:
Does this plan appear to fall in line with your above thoughts @cansavvy? |
This sounds good to me! |
If you have transformed values as you describe, you will not have TPMs. If you are concerned about low total counts, I think you could probably add a step before transformation. |
Gotcha, I updated the plan in my comment above. Does that appear to be more logical? |
Change TPM to counts and it looks set to me |
I have encountered the following error in working on getting the PR related to this issue ready:
This occurs at the I have one more idea that I am about to implement, but in the meantime, I wanted to find out if anyone has suggestions for this particular situation. |
A few comments/questions to potentially get you started:
|
Ah, thank you @jaclyn-taroni, I misinterpreted that part of the vignette mentioned in the comment with my plan for this issue, I was able to create the DESeqDataSet object and thought that I needed to run the
I use the |
Glad you figured out about the DESeqDataSet object part.
I think we would still want to stick with the experimental_grouping variable here. Lemme check the notebook and get back to you. |
Generally speaking, I would expect dataset creation from matrix -> transformation ->
That will, in many but probably not all cases, be a unique value. So
I would expect the transformation could be blinded to the experimental design, so the design may not be important (it depends™️ on the experiment) |
Gotcha and noted 👍
Of course, that makes total sense.
👍 |
I think we still want to stick with a
We may want to debate about whether this manner of making the variable is the best to suggest (feels a tad precarious). The default DESEqDataset will take the last variable (which isn't what we want) but you can also specify @jaclyn-taroni edit: Regardless of whether you incorporate the SP/MP into the DESeq2Dataset object, we will still want those labels for the heatmap annotation. |
I'll see if I can come up with a relatively less precarious method along the way.
Ah okay, gotcha |
In respect to @cansavvy's suggestion above taken from a comment on the open PR link to this issue, I have tested out a number of larger datasets and below I will list two of the datasets I tested that seemed to cluster a bit better: Regulation of myeloid function by LC3-associated phagocytosis promotes tumor immune tolerance Do any of these seem to be more suitable for this particular use case? If not, what would you like to see in a dataset for the notebook in this issue? |
I like the general vibes of
But I have one qualm/question: Can we find out if some of these / which of these are replicates versus different animals completely? This However, if we can figure it out the experimental model, and it still seems like a decent enough for clustering, it could be a good example of making sure to take into account replicates in a responsible manner. |
@cansavvy I did not explicitly find information on which of the samples are replicates in the paper, but they did note that they "chose 10 mice per group (digoxin-treated and controls)". |
They didn't make it easy. Their supplement never includes a table that explains their samples/mice. But I've decoded it based on Table S9 and doing some string manipulations. There are 6 mice in the final experiment and 4 replicates of each. Turns out the first part is the treament: e.g. I did this sloppy thing to check if this was right.
If we are going to continue with this dataset, I would like to know how the replicates cluster (or don't cluster) together. The question is whether we think it's useful to continue digging into this dataset or if we should look for something else. Let's chat about it briefly after/during the science team meeting. |
This is a reminder that once we nail down what dataset we will use here, we will need to discuss what the appropriate variable is to supply to the |
Per our discussion in science team meeting, I think we should move forward with the xenograft mouse medulloblastoma dataset you've found here, @cbethell , so as you move along, we'll just need to discuss what's the best way to model dealing with replicates in clustering. This probably starts with determining what combination of variables need to be included in the Although this may not affect the normalization step per se (may want to look in the For this xenograft mouse medulloblastoma dataset, we will need to involve the replicates in the design. Taking another look at the metadata that is included on GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE115542 In this case, we should look into using We need to figure out what this might look like in the case of this clustering example. I think our goal would be to:
This being said, I don't know enough about how
If it's only at the differential expression step that |
I took a look and agree with you on what's happening here.
My findings suggest that the answer here is yes, so in my last commit over on open PR #140, I implemented the |
As discussed with @jaclyn-taroni, the connected PR #140 has been closed due to the overwhelming amount of comments that have accumulated over the course of the PR. The plan now is to open a new draft PR with one of the datasets that were posted in the last comment on PR #140. Per @jaclyn-taroni's comment on PR #140, I will choose the dataset I believe is simpler/less complex of the two based on the amount of metadata smoothing required. The simpler dataset also means that the collapse replicates steps will be removed and shorten the overall length of the clustering notebook. Said draft PR will include a more simplified code for the clustering example notebook, that would start with the heatmap, then deal with the metadata and annotation bars one by one. These changes will be solely code related upon the opening of the draft PR, writing/context will be refined after. |
This ticket has been addressed in merged PR #151 and can be closed. 🎉 |
Related to #110
Currently we have a microarray example of clustering with ComplexHeatmap. We should be able to largely copy the microarray example but switch out the dataset and the accompanying wording.
When switching to an RNA-seq dataset we should also keep in mind whether the strategy of keeping genes with the largest variances is still appropriate (I think the answer should be yes). But perhaps in addition we'll want a minimum TPMs cutoff?
This can be done separately from the "getting started" additions to each analysis #116
The text was updated successfully, but these errors were encountered: