-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tentative implementation for adjusting of confounding factors using edgeR #2539
Conversation
Hi @robinpaul85 thanks for this pr.
I will continue to test. |
Can you test the branch without using a sessions file? I believe then default fc will be 0.
Dec 30, 2024 5:48:22 PM karishma gangwani ***@***.***>:
…
Hi @robinpaul85[https://github.com/robinpaul85] thanks for this pr.
1. when running covariate adjustment we should only run it on edgeR and not wilcoxon (non-parametric method). Can you make this change on the UI so we don't end up seeing type I errors? For wilcoxon just normalizing for batch effects should be enough.
2. on the UI why is the default fc now 2 again? I believe we had made that 0 in the previous PR. Can you check this and fix?
3. When switching fc to 0 and then method to edgeR and selecting 'Molecular subtypes' as the covariate doesn't load anything for me. I don't see any client or server-side errors, using the same example of 259 vs 539 samples for sensitive vs resitant.
I will continue to test.
—
Reply to this email directly, view it on GitHub[#2539 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AKCZFMQ2CNPBBO5PUR5FYXD2IHLULAVCNFSM6AAAAABT7XV5LKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRWGAYTGOJUGA].
You are receiving this because you were mentioned.
[Tracking image][https://github.com/notifications/beacon/AKCZFMREQOWB6H5PFO4LU2T2IHLULA5CNFSM6AAAAABT7XV5LKWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTUY6JB7I.gif]
|
okay, that seems to work. so 2. and 3. are fine. can you look into 1? |
I think that creation of DE groups has nothing to do with the DE app itself. The DE app only requires two groups. If you want to remove overlap it would have to be handled by some upstream code.
Dec 31, 2024 12:54:33 PM karishma gangwani ***@***.***>:
…
Can you test the branch without using a sessions file? I believe then default fc will be 0. Dec 30, 2024 5:48:22 PM karishma gangwani /*@*/.***>:
…[#]
Hi @robinpaul85[https://github.com/robinpaul85][https://github.com/robinpaul85] thanks for this pr. 1. when running covariate adjustment we should only run it on edgeR and not wilcoxon (non-parametric method). Can you make this change on the UI so we don't end up seeing type I errors? For wilcoxon just normalizing for batch effects should be enough. 2. on the UI why is the default fc now 2 again? I believe we had made that 0 in the previous PR. Can you check this and fix? 3. When switching fc to 0 and then method to edgeR and selecting 'Molecular subtypes' as the covariate doesn't load anything for me. I don't see any client or server-side errors, using the same example of 259 vs 539 samples for sensitive vs resitant. I will continue to test. — Reply to this email directly, view it on GitHub[#2539 (comment)[#2539 (comment)]], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AKCZFMQ2CNPBBO5PUR5FYXD2IHLULAVCNFSM6AAAAABT7XV5LKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRWGAYTGOJUGA]. You are receiving this because you were mentioned. [Tracking image][https://github.com/notifications/beacon/AKCZFMREQOWB6H5PFO4LU2T2IHLULA5CNFSM6AAAAABT7XV5LKWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTUY6JB7I.gif]
okay, that seems to work. so 2. and 3. are fine. can you look into 1?
also, another thing we are not handling currently is when groups are created with overlapping samples. We don't have any way to run the DE analysis after excluding out the overlapping samples from the two groups. We should think of a way to do that and enable the analysis for the non-overlapping. I can make a note of it for another pr. let me know. because for those overlapping samples we might have to enable some other kinds of analyses.
—
Reply to this email directly, view it on GitHub[#2539 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AKCZFMR4CQVUVEO6ZZTYXYT2ILR6PAVCNFSM6AAAAABT7XV5LKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRWGY2TSMZXHE].
You are receiving this because you were mentioned.
[Tracking image][https://github.com/notifications/beacon/AKCZFMUUQ7C2DNRDAEASOGL2ILR6PA5CNFSM6AAAAABT7XV5LKWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTUY7QOTG.gif]
|
yes, that is fine. this is not important for now. We can discuss the overlapping samples later. but you should make the fix for running the adjustment for edgeR when a covariate is selected wilcoxon button should hide (for now) since this adjustment is only applicable to edgeR. until we have enabled some batch correction or other adjustment models suitable for wilcoxon. |
also, it takes a significant amount of time to run the DE analysis after adding the variable. Is there a way to speed up the process? It shows 'Loading...' but looks like it is frozen. |
Not that I can think of at the moment. That is why I had suggested precomputing earlier. I can try benchmarking with DESeq2 to see how fast it takes. |
You could run DESeq2 and see (better in a separate branch). And Yes, maybe precomputing will be a better idea if it's going to take this long to run it on-the-fly. We can discuss with Xin once. Can you look into the batch correction for Wilcoxon? We can test that out as well in this pr. |
895afc9
to
cd5d055
Compare
server/utils/edge.R
Outdated
|
||
# Get the column numbers (in the HDF5 file) corresponding to the sample ID for case cohort | ||
parse_sample_indices_time <- system.time({ | ||
samples_indicies <- c() | ||
for (sample in cases) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can optimize these for loops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HDF5 files requires sample indices not sample names. So, I think the current implementation makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is fine. I am saying instead of using for loops you could use match
instead to create your indices and then check against that. something like this. see if this is possible?
case_indices <- match(cases, samples)
control_indices <- match(controls, samples)
if (any(is.na(case_indices))) {
missing_cases <- cases[is.na(case_indices)]
print(paste(missing_cases, "not found"))
quit(status = 1)
}
server/utils/edge.R
Outdated
} else { | ||
print (paste(sample,"not found")) | ||
quit(status = 1) | ||
} | ||
} | ||
|
||
# Get the column numbers (in the HDF5 file) corresponding to the sample ID for control cohort | ||
for (sample in controls) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reply is the same too as the previous comment.
server/utils/edge.R
Outdated
#data %>% select(all_of(combined)) | ||
#read_file_time_start <- Sys.time() | ||
case_sample_list <- c() | ||
control_sample_list <- c() | ||
if (exists(input$storage_type)==FALSE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check with exists(input$storage_type) seems redundant. can get rid of it. since this is defined in your JSON input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and also can get rid of the last else block. seems redundant. around line 109-110. can add quit(status = 1)
where you are printing 'Unknown storage type' at line 106 instead and the remaining else block below it is not needed. it is repetitive.
values[] // using integer sample id | ||
samplelst{} | ||
groups[] | ||
values[] // using integer sample id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 66 comment is repeated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has now been removed
@@ -57,12 +72,16 @@ param{} | |||
const group1names = [] as string[] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 72-105 can be consolidated in a single function for group1 and group2 names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can do this in the following branch, when I implement DESeq2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 177 in termdb.DE.ts, can just combine the logic for running default wilcoxon with the previous block and get rid of the last else statement. you can add the cutoff for sample size similar to the implementation where edgeR is selected as parameter in the first if block. basically can clean this part and consolidate it.
This has now been removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here are my results. It looks good to me now. We changed the hdf5 file to protein coding only and made optimizations to the code.
including non-coding genes:
no confounders:
Time taken to run edgeR: 152778 ms
line: Time to read JSON: 0.001 seconds
line: Time to read counts data: 29.145 seconds
line: Time to generate DGEList: 0.191 seconds
line: Time to filter by expression: 0.243 seconds
line: Normalization time: 14.02 seconds
line: Dispersion time: 78.597 seconds
line: Exact test time: 29.216 seconds
line: Time for multiple testing correction: 0.038 seconds
with 1 confounder (Molecular subtype):
Time taken to run edgeR: 762764 ms
line: Time to read JSON: 0.002 seconds
line: Time to read counts data: 27.745 seconds
line: Time to generate DGEList: 0.177 seconds
line: Time to filter by expression: 0.259 seconds
line: Normalization time: 14.539 seconds
line: Time for making design matrix: 0.002 seconds
line: Dispersion time: 606.712 seconds
line: Fit time: 76.012 seconds
line: Test statistics time: 35.846 seconds
line: Time for multiple testing correction: 0.03 seconds
only with protein coding genes:
no confounders:
line: Time to read JSON: 0.002 seconds
line: Time to read counts data: 6.949 seconds
line: Time to generate DGEList: 0.061 seconds
line: Time to filter by expression: 0.103 seconds
line: Normalization time: 7.351 seconds
line: Dispersion time: 46.281 seconds
line: Exact test time: 15.847 seconds
line: Time for multiple testing correction: 0.014 seconds
with 1 confounder (Molecular subtype):
line: Time to read JSON: 0.002 seconds
line: Time to read counts data: 6.92 seconds
line: Time to generate DGEList: 0.061 seconds
line: Time to filter by expression: 0.108 seconds
line: Normalization time: 7.637 seconds
line: Time for making design matrix: 0.002 seconds
line: Dispersion time: 344.394 seconds
line: Fit time: 43.227 seconds
line: Test statistics time: 20.326 seconds
line: Time for multiple testing correction: 0.012 seconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
Note: Please add
features.run_parametricDE
inserverconfig.json
for testing this feature.Tentative implementation for adjusting of confounding factors using edgeR
Caveats:
Currently only support for adjustment of 1 confounding factors. In the future, will allow for adjustment of multiple confounding factors.
Implementation of adjustment of confounding factors is even slower than normal edgeR (which is already significantly slower than the wilcoxon method). For e.g . group 1 sample size = 259, group2 sample size = 539 it takes about ~3 mins (without adjustment of confounding factors) and ~15mins WITH adjustment of confounding factors.
Current implementation seem to only work for simple confounding factors for now. Will also work on this.
Checklist
Check each task that has been performed or verified to be not applicable.