WIP: WGCNA module draft #350

cansavvy · 2020-11-11T16:19:20Z

Analysis Purpose

It's a spin off issue from #306 and the discussion on #346.

Ultimately I was trying to see if we could squeak in a quick WGCNA run for ORA in one module, but as you can see here, it's not really realistic for the. WGCNA run to be that short particularly since we also need to do some DESeq2 transformations before we can get to WGCNA.

Pull Request Stage

This is a Draft PR - needs review of big concepts and outline

Strategy

I've been trying to sort through what seemed like the important points of the WGCNA tutorials and:

Make them tidyverse and ggplot2
Put more explanation about what things mean for the user

Also made sure we are following the RNA-seq guidance that's written in the FAQ of WGCNA.

Concerns/Questions for reviewers:

In the "Explore WGCNA" results section you will see I'm doing a number of things, making multiple plots. I in no way expect we keep all those things, but instead I want to know which things we find most useful and we can delete the others and I'll expand a tad more on the useful things. Here's a summary of the things I'm proposing (some of which are inspired by plots I see in WGCNA tutorials):
- Show users how to save the which gene is which module key (this seems necessary/useful)
- A metadata labeled, sample x eigengene heatmap
- A module eigengene x module eigengene heatmap
You'll see I briefly mention/explain TOMtype parameter. Do we think this is unnecessary/out of scope for our users?
I've been looking at the tutorials and papers and forums and I can't for the life of me find an intuitive explanation for what "power" for soft-threshold network construction actually is (not info on how to use it, but like what does it mean?? What does WGCNA do with it). So if anyone has insight on that, I couldn't figure it out.
I have set the seed in the main WGCNA function blockwiseModules() but it sounds like a global set the seed might work instead? I suppose global is preferred or no?

Analysis Pull Request Check List (roughly in order):

Content checks

All {{BLANKS}} have been replaced with the correct content.
Sources are cited
Seed is set (if applicable)

Formatting Checks

Removed any manual numbering of sections.
Removed any instances of chunk naming.
Comments and documentation are up to date.
All links have been checked and are properly formatted.

Add datasets to S3

Added data and metadata files to S3.

Docker/Snakemake rendering components

Added the .html link to the navigation bar.
Any not yet added packages needed for this analysis have been added to the Dockerfile and it successfully builds.

cansavvy · 2020-11-11T17:12:37Z

network-analysis_rnaseq_01_wgcna.html.zip

Here's the rendered html if you wanna see what the draft plots look like without checking out the branch.

jashapiro · 2020-11-11T19:59:27Z

I'll start with the elephant, which is that I share your concerns about this being a bit of a bear (mixed animal metaphors for the win). I'm not 100% sure how I feel about that on the whole. It does seem like something that might end up in "advanced" rather than the main set of analyses.

I am also not sure whether you are intending to keep this described as ORA or if that is just leftovers from the template. I see it as quite distinct from a standard ORA, and I would not refer to it as such. Maybe you are planning to use these modules as the sets for ORA, but that would seem rather circular to me.

Skipping to the end, I feel like we would want some kind of rough conclusion for users to get out of this analysis. Is there a way we can reduce the number of modules that are returned to get the ones we might care most about? What do we do with these modules once we find them? I think the relationship of the eigengenes to the metadata labels gets closest to answering this question, so it seems like the best place to focus effort.

To make this easier to interpret, we might try to come up with ways to reduce the number of Eigengenes that we get out. That might involve adjusting and examining some parameters and thresholds, which I imagine is work that might not make it into an example notebook, but should be done. The thing we need to think about is how to provide a balance between a single example that looks pretty good and the fact that parameters and results should be examined carefully.

Answering your question about power: I think based on my reading that the Power is the exponent to which they raise the correlations to before thresholding & dividing the correlation network into modules. This separates the really high correlations from the just kind of high correlations. At least that is my read. (0.99^14 vs 0.98^14 is a much larger difference than 0.99 vs. 0.98, and I imagine that helps with the thresholding steps). So yeah, the first definition of power you learned, not statistical power, which is what I would have assumed based on no context (except that the values make no sense as statistical power).

A couple of specific notes:

power = 14 may be too low for this example, as it makes the threshold just below 0.8, not above it. Not that this seems like it should be a hard rule. This and other parameters probably require tweaking, which is too much to cover here, but we should present an example that looks "good". If you raise the power, do you get more or fewer eigengenes?
It would probably make sense to use numeric module labels rather than the color names, especially as you are not plotting those colors at all. (Also, I think "grey" means no module as the equivalent to module 0, which is worth noting).
Handy tidyverse function: tibble::enframe()
You can replace this:

gene_modules_df <- data.frame(gene_modules) %>%
  tibble::rownames_to_column("Gene")

with:

gene_modules_df <- tibble::enframe(gene_modules, name = "gene", value = "module")

To me that is more clear what is going on, and does it all in one neat step.
Though really, for this use case, just table(gene_modules) would be sufficient. I don't think we need to tidy all the things.

cansavvy · 2020-11-11T21:20:40Z

@jashapiro, This was my attempt to "simplify" the WGCNA tutorials, but I think you are correct that it is still too much for a non-advanced topics user to go through. I tried to give just enough parameter guidance as to not be totally cavalier with the analysis but I think if someone were to really get into this the right way, they should dive into the parameters more than the user of this section may be willing or able to do. So... I think moving this to advanced-topics makes sense.

To clear up the confusion about is this ORA or not:
Although this is not supposed to be ORA anymore, the idea was wee would take one of the gene modules from this analysis to do ORA -- in its own separate example notebook #344. (The ORA comments in the notebook leftover are remnants from the older version of this draft which was ORA but I wanted to get big picture thoughts before doing anymore polishing)

So, this means as far as timeline, we need to discuss the further work on this since the priority for this sprint is getting the pathway analyses and this was a tangent that was supposed to feed into an ORA, pathway analysis example. That being said it doesn't make sense to have an advanced topics example feed into a "main" example, so we will have to reevaluate where we are getting a gene set from for the ORA example #344.

In other notes:

Thanks for your interpretation on power. I had a suspicion it related to exponents but couldn't figure out for sure what it meant.
I like your other suggestions and will implement them after we figure out "big picture" where this example is going.

jashapiro · 2020-11-11T21:27:56Z

To clear up the confusion about is this ORA or not:
Although this is not supposed to be ORA anymore, the idea was wee would take one of the gene modules from this analysis to do ORA -- in its own separate example notebook #344. (The ORA comments in the notebook leftover are remnants from the older version of this draft which was ORA but I wanted to get big picture thoughts before doing anymore polishing)

Is this a thing that people do? Because as I said, it seems kind of circular. I can see taking modules from one data set and using them for ORA in another data set, but if you are using modules defined in the same data set, I would assume that all the genes will be significant or none of them will be, as the modules are defined by the correlations between genes...

jaclyn-taroni · 2020-11-11T21:31:45Z

The question I was after is "Does this coexpression module significantly overlap with known pathways?" So given a module that is associated with metadata, for example, are p53 pathway genes overrepresented?

That being said it doesn't make sense to have an advanced topics example feed into a "main" example, so we will have to reevaluate where we are getting a gene set from for the ORA example

I'm not outright opposed to having an advanced example feed into a "main" example. We're not sure people will ever look at more than one example and our point for getting on this tangent was basically that any list of genes of interest will work.

cansavvy · 2020-11-11T21:37:54Z

I'm not outright opposed to having an advanced example feed into a "main" example. We're not sure people will ever look at more than one example and our point for getting on this tangent was basically that any list of genes of interest will work.

I suppose that's fair. We usually link to the notebook where the data come from though, in case the user wants to produce the data themselves, so I guess we'd have to ask if it's okay to link a user to a more advanced topic or will that be overwhelming. It's obviously unclear to us at this stage how users will be navigating these kinds of examples with predecessors so perhaps it's not a problem.

jashapiro · 2020-11-11T21:58:23Z

The question I was after is "Does this coexpression module significantly overlap with known pathways?" So given a module that is associated with metadata, for example, are p53 pathway genes overrepresented?

Ah, that makes more sense...

I am also not opposed to using an advanced example to feed back in, but it does seems a bit of a slippery slope. It could add confusion as to the "why" for a particular analysis.

cansavvy · 2020-11-11T22:03:06Z

I am also not opposed to using an advanced example to feed back in, but it does seems a bit of a slippery slope. It could add confusion as to the "why" for a particular analysis.

I think the plan is to move forward as is, but when we add a more simple method of getting gene cluster example, we can replace it in ORA. See here: #344 (comment)

cansavvy · 2020-11-11T22:50:03Z

So it seems we probably want to end up with a gene list from a module that is most related to a metadata variable of interest. This seems most pertinent to what users would want to do as well as our downstream ORA application. This is somewhat related to my rough draft of a heatmap of eigengenes with metadata labels, but should probably be a little more methodical than that.

One of the WGCNA tutorials suggests doing this kind of analysis through more correlations and a plot like this:

If we want to do this type of analysis, this means we'd actually have to have a meaningful numeric variable or more which means this dataset won't work and I'll have to do some digging to find a dataset that would work. (So meanwhile I'm not going to worry about playing with power if we switch datasets completely).

However, I'm not crazy about more correlations, so if people have other suggestions of how to identify modules of most interest, I'm open ears. I'll do some digging and see if there's other legit ideas out there.

jaclyn-taroni · 2020-11-11T22:58:21Z

You can test the module eigengenes for differential expression if you have a categorical variable that's of interest.

cansavvy · 2020-11-11T23:06:34Z

You can test the module eigengenes for differential expression if you have a categorical variable that's of interest.

I was thinking about that but didn't know if that'd be considered a "valid" method.

jashapiro · 2020-11-12T00:19:06Z

I was thinking about that but didn't know if that'd be considered a "valid" method.

It absolutely would be! Think about the eigengenes as a smaller set of genes that you can use to reduce multiple testing burdens. SO you can treat them as "gene" with measured expression, and then ask if the expression is different between the groups you care about.

In the example you showed above, they are comparing lots of different continuous traits to eigengene "expression", and plotting all the correlation coefficients, but the much simpler case is when you have only one trait and you want to see which of the eigengenes is correlated with it. In the very simplest case, you can do something like a t-test comparing expression between two groups of samples. If you go from 20,000 genes to 100 eigengenes, that can give you a huge boost in power (statistical power here for realsies) because you don't have so much multiple testing to correct for.

jaclyn-taroni · 2020-11-12T13:47:20Z

I wish GitHub had ☝️ as a reaction. @jashapiro is exactly right. WGCNA is both somewhat biologically motivated if we think that genes that are coordinated in expression might participate in a coherent process and technically motivated if we want to reduce the number of variables we need to test.

cansavvy · 2020-11-12T14:17:17Z

So for differential expression do I make a new DESeq2 object from the eigengenes and use DESeq2 for differential expression per usual? Or do I do an ANOVA with a base R function?

jaclyn-taroni · 2020-11-12T14:19:54Z

I would probably keep it simple per Josh's suggestion.

cansavvy · 2020-11-12T15:55:18Z

I'm probably going to open up a second Draft PR since much of this discussion has lead to quite a bit of changes and it'd probably be best to start fresh.

… same results

cansavvy added 7 commits November 10, 2020 09:54

Push a basic draft that doesn't exactly work

aa88556

More words and polishing

c029600

Try out some plots

9ac8121

Put some DRAFT and REVIEW designations

51bd54d

Add to Dockerfile

743fb45

Get rid of umap edit

cff76f6

Add wgcna example to snakefile and run

4f77862

cansavvy requested a review from jashapiro November 11, 2020 18:33

Move WGCNA to the "advanced topics" folder

364e175

remove rna-seq files

58b4550

cansavvy added 2 commits November 12, 2020 11:20

A set up with ANOVA

ab3df5b

Use limma instead. Its more straightforward and gives essentially the…

d16639a

… same results

cansavvy closed this Nov 12, 2020

cansavvy mentioned this pull request Nov 12, 2020

WIP: WGCNA second draft #353

Closed

10 tasks

cansavvy deleted the cansavvy/wgcna branch November 18, 2020 16:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: WGCNA module draft #350

WIP: WGCNA module draft #350

cansavvy commented Nov 11, 2020 •

edited

Loading

cansavvy commented Nov 11, 2020

jashapiro commented Nov 11, 2020

cansavvy commented Nov 11, 2020

jashapiro commented Nov 11, 2020

jaclyn-taroni commented Nov 11, 2020

cansavvy commented Nov 11, 2020

jashapiro commented Nov 11, 2020

cansavvy commented Nov 11, 2020

cansavvy commented Nov 11, 2020

jaclyn-taroni commented Nov 11, 2020

cansavvy commented Nov 11, 2020

jashapiro commented Nov 12, 2020

jaclyn-taroni commented Nov 12, 2020

cansavvy commented Nov 12, 2020

jaclyn-taroni commented Nov 12, 2020

cansavvy commented Nov 12, 2020

WIP: WGCNA module draft #350

WIP: WGCNA module draft #350

Conversation

cansavvy commented Nov 11, 2020 • edited Loading

Analysis Purpose

Pull Request Stage

Strategy

Concerns/Questions for reviewers:

Analysis Pull Request Check List (roughly in order):

Content checks

Formatting Checks

Add datasets to S3

Docker/Snakemake rendering components

cansavvy commented Nov 11, 2020

jashapiro commented Nov 11, 2020

cansavvy commented Nov 11, 2020

jashapiro commented Nov 11, 2020

jaclyn-taroni commented Nov 11, 2020

cansavvy commented Nov 11, 2020

jashapiro commented Nov 11, 2020

cansavvy commented Nov 11, 2020

cansavvy commented Nov 11, 2020

jaclyn-taroni commented Nov 11, 2020

cansavvy commented Nov 11, 2020

jashapiro commented Nov 12, 2020

jaclyn-taroni commented Nov 12, 2020

cansavvy commented Nov 12, 2020

jaclyn-taroni commented Nov 12, 2020

cansavvy commented Nov 12, 2020

cansavvy commented Nov 11, 2020 •

edited

Loading