Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: WGCNA module draft #350

Closed
wants to merge 11 commits into from
Closed

WIP: WGCNA module draft #350

wants to merge 11 commits into from

Conversation

cansavvy
Copy link
Contributor

@cansavvy cansavvy commented Nov 11, 2020

Analysis Purpose

It's a spin off issue from #306 and the discussion on #346.

Ultimately I was trying to see if we could squeak in a quick WGCNA run for ORA in one module, but as you can see here, it's not really realistic for the. WGCNA run to be that short particularly since we also need to do some DESeq2 transformations before we can get to WGCNA.

Pull Request Stage

This is a Draft PR - needs review of big concepts and outline

Strategy

I've been trying to sort through what seemed like the important points of the WGCNA tutorials and:

  • Make them tidyverse and ggplot2
  • Put more explanation about what things mean for the user

Also made sure we are following the RNA-seq guidance that's written in the FAQ of WGCNA.

Concerns/Questions for reviewers:

  • In the "Explore WGCNA" results section you will see I'm doing a number of things, making multiple plots. I in no way expect we keep all those things, but instead I want to know which things we find most useful and we can delete the others and I'll expand a tad more on the useful things. Here's a summary of the things I'm proposing (some of which are inspired by plots I see in WGCNA tutorials):

    • Show users how to save the which gene is which module key (this seems necessary/useful)
    • A metadata labeled, sample x eigengene heatmap
    • A module eigengene x module eigengene heatmap
  • You'll see I briefly mention/explain TOMtype parameter. Do we think this is unnecessary/out of scope for our users?

  • I've been looking at the tutorials and papers and forums and I can't for the life of me find an intuitive explanation for what "power" for soft-threshold network construction actually is (not info on how to use it, but like what does it mean?? What does WGCNA do with it). So if anyone has insight on that, I couldn't figure it out.

  • I have set the seed in the main WGCNA function blockwiseModules() but it sounds like a global set the seed might work instead? I suppose global is preferred or no?

Analysis Pull Request Check List (roughly in order):

Content checks

  • All {{BLANKS}} have been replaced with the correct content.
  • Sources are cited
  • Seed is set (if applicable)

Formatting Checks

  • Removed any manual numbering of sections.
  • Removed any instances of chunk naming.
  • Comments and documentation are up to date.
  • All links have been checked and are properly formatted.

Add datasets to S3

Docker/Snakemake rendering components

  • Added the .html link to the navigation bar.
  • Any not yet added packages needed for this analysis have been added to the Dockerfile and it successfully builds.

@cansavvy
Copy link
Contributor Author

network-analysis_rnaseq_01_wgcna.html.zip

Here's the rendered html if you wanna see what the draft plots look like without checking out the branch.

@cansavvy cansavvy requested a review from jashapiro November 11, 2020 18:33
@jashapiro
Copy link
Member

I'll start with the elephant, which is that I share your concerns about this being a bit of a bear (mixed animal metaphors for the win). I'm not 100% sure how I feel about that on the whole. It does seem like something that might end up in "advanced" rather than the main set of analyses.

I am also not sure whether you are intending to keep this described as ORA or if that is just leftovers from the template. I see it as quite distinct from a standard ORA, and I would not refer to it as such. Maybe you are planning to use these modules as the sets for ORA, but that would seem rather circular to me.

Skipping to the end, I feel like we would want some kind of rough conclusion for users to get out of this analysis. Is there a way we can reduce the number of modules that are returned to get the ones we might care most about? What do we do with these modules once we find them? I think the relationship of the eigengenes to the metadata labels gets closest to answering this question, so it seems like the best place to focus effort.

To make this easier to interpret, we might try to come up with ways to reduce the number of Eigengenes that we get out. That might involve adjusting and examining some parameters and thresholds, which I imagine is work that might not make it into an example notebook, but should be done. The thing we need to think about is how to provide a balance between a single example that looks pretty good and the fact that parameters and results should be examined carefully.

Answering your question about power: I think based on my reading that the Power is the exponent to which they raise the correlations to before thresholding & dividing the correlation network into modules. This separates the really high correlations from the just kind of high correlations. At least that is my read. (0.99^14 vs 0.98^14 is a much larger difference than 0.99 vs. 0.98, and I imagine that helps with the thresholding steps). So yeah, the first definition of power you learned, not statistical power, which is what I would have assumed based on no context (except that the values make no sense as statistical power).

A couple of specific notes:

  • power = 14 may be too low for this example, as it makes the threshold just below 0.8, not above it. Not that this seems like it should be a hard rule. This and other parameters probably require tweaking, which is too much to cover here, but we should present an example that looks "good". If you raise the power, do you get more or fewer eigengenes?

  • It would probably make sense to use numeric module labels rather than the color names, especially as you are not plotting those colors at all. (Also, I think "grey" means no module as the equivalent to module 0, which is worth noting).

  • Handy tidyverse function: tibble::enframe()
    You can replace this:

gene_modules_df <- data.frame(gene_modules) %>%
  tibble::rownames_to_column("Gene")

with:

gene_modules_df <- tibble::enframe(gene_modules, name = "gene", value = "module")

To me that is more clear what is going on, and does it all in one neat step.
Though really, for this use case, just table(gene_modules) would be sufficient. I don't think we need to tidy all the things.

@cansavvy
Copy link
Contributor Author

@jashapiro, This was my attempt to "simplify" the WGCNA tutorials, but I think you are correct that it is still too much for a non-advanced topics user to go through. I tried to give just enough parameter guidance as to not be totally cavalier with the analysis but I think if someone were to really get into this the right way, they should dive into the parameters more than the user of this section may be willing or able to do. So... I think moving this to advanced-topics makes sense.

To clear up the confusion about is this ORA or not:
Although this is not supposed to be ORA anymore, the idea was wee would take one of the gene modules from this analysis to do ORA -- in its own separate example notebook #344. (The ORA comments in the notebook leftover are remnants from the older version of this draft which was ORA but I wanted to get big picture thoughts before doing anymore polishing)

So, this means as far as timeline, we need to discuss the further work on this since the priority for this sprint is getting the pathway analyses and this was a tangent that was supposed to feed into an ORA, pathway analysis example. That being said it doesn't make sense to have an advanced topics example feed into a "main" example, so we will have to reevaluate where we are getting a gene set from for the ORA example #344.

In other notes:

  • Thanks for your interpretation on power. I had a suspicion it related to exponents but couldn't figure out for sure what it meant.
  • I like your other suggestions and will implement them after we figure out "big picture" where this example is going.

@jashapiro
Copy link
Member

To clear up the confusion about is this ORA or not:
Although this is not supposed to be ORA anymore, the idea was wee would take one of the gene modules from this analysis to do ORA -- in its own separate example notebook #344. (The ORA comments in the notebook leftover are remnants from the older version of this draft which was ORA but I wanted to get big picture thoughts before doing anymore polishing)

Is this a thing that people do? Because as I said, it seems kind of circular. I can see taking modules from one data set and using them for ORA in another data set, but if you are using modules defined in the same data set, I would assume that all the genes will be significant or none of them will be, as the modules are defined by the correlations between genes...

@jaclyn-taroni
Copy link
Member

The question I was after is "Does this coexpression module significantly overlap with known pathways?" So given a module that is associated with metadata, for example, are p53 pathway genes overrepresented?

That being said it doesn't make sense to have an advanced topics example feed into a "main" example, so we will have to reevaluate where we are getting a gene set from for the ORA example

I'm not outright opposed to having an advanced example feed into a "main" example. We're not sure people will ever look at more than one example and our point for getting on this tangent was basically that any list of genes of interest will work.

@cansavvy
Copy link
Contributor Author

I'm not outright opposed to having an advanced example feed into a "main" example. We're not sure people will ever look at more than one example and our point for getting on this tangent was basically that any list of genes of interest will work.

I suppose that's fair. We usually link to the notebook where the data come from though, in case the user wants to produce the data themselves, so I guess we'd have to ask if it's okay to link a user to a more advanced topic or will that be overwhelming. It's obviously unclear to us at this stage how users will be navigating these kinds of examples with predecessors so perhaps it's not a problem.

@jashapiro
Copy link
Member

The question I was after is "Does this coexpression module significantly overlap with known pathways?" So given a module that is associated with metadata, for example, are p53 pathway genes overrepresented?

Ah, that makes more sense...

I am also not opposed to using an advanced example to feed back in, but it does seems a bit of a slippery slope. It could add confusion as to the "why" for a particular analysis.

@cansavvy
Copy link
Contributor Author

I am also not opposed to using an advanced example to feed back in, but it does seems a bit of a slippery slope. It could add confusion as to the "why" for a particular analysis.

I think the plan is to move forward as is, but when we add a more simple method of getting gene cluster example, we can replace it in ORA. See here: #344 (comment)

@cansavvy
Copy link
Contributor Author

So it seems we probably want to end up with a gene list from a module that is most related to a metadata variable of interest. This seems most pertinent to what users would want to do as well as our downstream ORA application. This is somewhat related to my rough draft of a heatmap of eigengenes with metadata labels, but should probably be a little more methodical than that.

One of the WGCNA tutorials suggests doing this kind of analysis through more correlations and a plot like this:
Screen Shot 2020-11-11 at 5 43 51 PM

If we want to do this type of analysis, this means we'd actually have to have a meaningful numeric variable or more which means this dataset won't work and I'll have to do some digging to find a dataset that would work. (So meanwhile I'm not going to worry about playing with power if we switch datasets completely).

However, I'm not crazy about more correlations, so if people have other suggestions of how to identify modules of most interest, I'm open ears. I'll do some digging and see if there's other legit ideas out there.

@jaclyn-taroni
Copy link
Member

You can test the module eigengenes for differential expression if you have a categorical variable that's of interest.

@cansavvy
Copy link
Contributor Author

You can test the module eigengenes for differential expression if you have a categorical variable that's of interest.

I was thinking about that but didn't know if that'd be considered a "valid" method.

@jashapiro
Copy link
Member

I was thinking about that but didn't know if that'd be considered a "valid" method.

It absolutely would be! Think about the eigengenes as a smaller set of genes that you can use to reduce multiple testing burdens. SO you can treat them as "gene" with measured expression, and then ask if the expression is different between the groups you care about.

In the example you showed above, they are comparing lots of different continuous traits to eigengene "expression", and plotting all the correlation coefficients, but the much simpler case is when you have only one trait and you want to see which of the eigengenes is correlated with it. In the very simplest case, you can do something like a t-test comparing expression between two groups of samples. If you go from 20,000 genes to 100 eigengenes, that can give you a huge boost in power (statistical power here for realsies) because you don't have so much multiple testing to correct for.

@jaclyn-taroni
Copy link
Member

I wish GitHub had ☝️ as a reaction. @jashapiro is exactly right. WGCNA is both somewhat biologically motivated if we think that genes that are coordinated in expression might participate in a coherent process and technically motivated if we want to reduce the number of variables we need to test.

@cansavvy
Copy link
Contributor Author

So for differential expression do I make a new DESeq2 object from the eigengenes and use DESeq2 for differential expression per usual? Or do I do an ANOVA with a base R function?

@jaclyn-taroni
Copy link
Member

I would probably keep it simple per Josh's suggestion.

@cansavvy
Copy link
Contributor Author

I'm probably going to open up a second Draft PR since much of this discussion has lead to quite a bit of changes and it'd probably be best to start fresh.

@cansavvy cansavvy closed this Nov 12, 2020
@cansavvy cansavvy mentioned this pull request Nov 12, 2020
10 tasks
@cansavvy cansavvy deleted the cansavvy/wgcna branch November 18, 2020 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants