Tentative implementation for adjusting of confounding factors using edgeR #2539

robinpaul85 · 2024-12-20T19:50:27Z

Note: Please add features.run_parametricDE in serverconfig.json for testing this feature.

Tentative implementation for adjusting of confounding factors using edgeR

Caveats:

Currently only support for adjustment of 1 confounding factors. In the future, will allow for adjustment of multiple confounding factors.
Implementation of adjustment of confounding factors is even slower than normal edgeR (which is already significantly slower than the wilcoxon method). For e.g . group 1 sample size = 259, group2 sample size = 539 it takes about ~3 mins (without adjustment of confounding factors) and ~15mins WITH adjustment of confounding factors.
Current implementation seem to only work for simple confounding factors for now. Will also work on this.

Checklist

Check each task that has been performed or verified to be not applicable.

Tests: added and/or passed unit and integration tests, or N/A
Todos: commented or documented, or N/A
Notable Changes: updated release.txt, prefixed a commit message with "fix:" or "feat:", added to an internal tracking document, or N/A

karishma-gangwani · 2024-12-30T23:47:58Z

Hi @robinpaul85 thanks for this pr.

when running covariate adjustment we should only run it on edgeR and not wilcoxon (non-parametric method). Can you make this change on the UI so we don't end up seeing type I errors? For wilcoxon just normalizing for batch effects should be enough.
on the UI why is the default fc now 2 again? I believe we had made that 0 in the previous PR. Can you check this and fix?
When switching fc to 0 and then method to edgeR and selecting 'Molecular subtypes' as the covariate doesn't load anything for me. I don't see any client or server-side errors, using the same example of 259 vs 539 samples for sensitive vs resitant.

I will continue to test.

robinpaul85 · 2024-12-31T04:22:03Z

Can you test the branch without using a sessions file? I believe then default fc will be 0. Dec 30, 2024 5:48:22 PM karishma gangwani ***@***.***>:

…

Hi @robinpaul85[https://github.com/robinpaul85] thanks for this pr. 1. when running covariate adjustment we should only run it on edgeR and not wilcoxon (non-parametric method). Can you make this change on the UI so we don't end up seeing type I errors? For wilcoxon just normalizing for batch effects should be enough. 2. on the UI why is the default fc now 2 again? I believe we had made that 0 in the previous PR. Can you check this and fix? 3. When switching fc to 0 and then method to edgeR and selecting 'Molecular subtypes' as the covariate doesn't load anything for me. I don't see any client or server-side errors, using the same example of 259 vs 539 samples for sensitive vs resitant. I will continue to test. — Reply to this email directly, view it on GitHub[#2539 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AKCZFMQ2CNPBBO5PUR5FYXD2IHLULAVCNFSM6AAAAABT7XV5LKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRWGAYTGOJUGA]. You are receiving this because you were mentioned. [Tracking image][https://github.com/notifications/beacon/AKCZFMREQOWB6H5PFO4LU2T2IHLULA5CNFSM6AAAAABT7XV5LKWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTUY6JB7I.gif]

karishma-gangwani · 2024-12-31T18:54:10Z

Can you test the branch without using a sessions file? I believe then default fc will be 0. Dec 30, 2024 5:48:22 PM karishma gangwani @.***>:
…
Hi @robinpaul85[https://github.com/robinpaul85] thanks for this pr. 1. when running covariate adjustment we should only run it on edgeR and not wilcoxon (non-parametric method). Can you make this change on the UI so we don't end up seeing type I errors? For wilcoxon just normalizing for batch effects should be enough. 2. on the UI why is the default fc now 2 again? I believe we had made that 0 in the previous PR. Can you check this and fix? 3. When switching fc to 0 and then method to edgeR and selecting 'Molecular subtypes' as the covariate doesn't load anything for me. I don't see any client or server-side errors, using the same example of 259 vs 539 samples for sensitive vs resitant. I will continue to test. — Reply to this email directly, view it on GitHub[#2539 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AKCZFMQ2CNPBBO5PUR5FYXD2IHLULAVCNFSM6AAAAABT7XV5LKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRWGAYTGOJUGA]. You are receiving this because you were mentioned. [Tracking image][https://github.com/notifications/beacon/AKCZFMREQOWB6H5PFO4LU2T2IHLULA5CNFSM6AAAAABT7XV5LKWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTUY6JB7I.gif]

okay, that seems to work. so 2. and 3. are fine. can you look into 1?
also, another thing we are not handling currently is when groups are created with overlapping samples. We don't have any way to run the DE analysis after excluding out the overlapping samples from the two groups. We should think of a way to do that and enable the analysis for the non-overlapping. I can make a note of it for another pr. let me know. because for those overlapping samples we might have to enable some other kinds of analyses.

robinpaul85 · 2025-01-01T04:03:10Z

I think that creation of DE groups has nothing to do with the DE app itself. The DE app only requires two groups. If you want to remove overlap it would have to be handled by some upstream code. Dec 31, 2024 12:54:33 PM karishma gangwani ***@***.***>:

…

Can you test the branch without using a sessions file? I believe then default fc will be 0. Dec 30, 2024 5:48:22 PM karishma gangwani /*@*/.***>: …[#] Hi @robinpaul85[https://github.com/robinpaul85][https://github.com/robinpaul85] thanks for this pr. 1. when running covariate adjustment we should only run it on edgeR and not wilcoxon (non-parametric method). Can you make this change on the UI so we don't end up seeing type I errors? For wilcoxon just normalizing for batch effects should be enough. 2. on the UI why is the default fc now 2 again? I believe we had made that 0 in the previous PR. Can you check this and fix? 3. When switching fc to 0 and then method to edgeR and selecting 'Molecular subtypes' as the covariate doesn't load anything for me. I don't see any client or server-side errors, using the same example of 259 vs 539 samples for sensitive vs resitant. I will continue to test. — Reply to this email directly, view it on GitHub[#2539 (comment)[#2539 (comment)]], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AKCZFMQ2CNPBBO5PUR5FYXD2IHLULAVCNFSM6AAAAABT7XV5LKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRWGAYTGOJUGA]. You are receiving this because you were mentioned. [Tracking image][https://github.com/notifications/beacon/AKCZFMREQOWB6H5PFO4LU2T2IHLULA5CNFSM6AAAAABT7XV5LKWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTUY6JB7I.gif] okay, that seems to work. so 2. and 3. are fine. can you look into 1? also, another thing we are not handling currently is when groups are created with overlapping samples. We don't have any way to run the DE analysis after excluding out the overlapping samples from the two groups. We should think of a way to do that and enable the analysis for the non-overlapping. I can make a note of it for another pr. let me know. because for those overlapping samples we might have to enable some other kinds of analyses. — Reply to this email directly, view it on GitHub[#2539 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AKCZFMR4CQVUVEO6ZZTYXYT2ILR6PAVCNFSM6AAAAABT7XV5LKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRWGY2TSMZXHE]. You are receiving this because you were mentioned. [Tracking image][https://github.com/notifications/beacon/AKCZFMUUQ7C2DNRDAEASOGL2ILR6PA5CNFSM6AAAAABT7XV5LKWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTUY7QOTG.gif]

karishma-gangwani · 2025-01-02T15:37:32Z

I think that creation of DE groups has nothing to do with the DE app itself. The DE app only requires two groups. If you want to remove overlap it would have to be handled by some upstream code. Dec 31, 2024 12:54:33 PM karishma gangwani @.>:
…
Can you test the branch without using a sessions file? I believe then default fc will be 0. Dec 30, 2024 5:48:22 PM karishma gangwani /@/.**>: …[#] Hi @robinpaul85[https://github.com/robinpaul85][https://github.com/robinpaul85] thanks for this pr. 1. when running covariate adjustment we should only run it on edgeR and not wilcoxon (non-parametric method). Can you make this change on the UI so we don't end up seeing type I errors? For wilcoxon just normalizing for batch effects should be enough. 2. on the UI why is the default fc now 2 again? I believe we had made that 0 in the previous PR. Can you check this and fix? 3. When switching fc to 0 and then method to edgeR and selecting 'Molecular subtypes' as the covariate doesn't load anything for me. I don't see any client or server-side errors, using the same example of 259 vs 539 samples for sensitive vs resitant. I will continue to test. — Reply to this email directly, view it on GitHub[#2539 (comment)[#2539 (comment)]], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AKCZFMQ2CNPBBO5PUR5FYXD2IHLULAVCNFSM6AAAAABT7XV5LKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRWGAYTGOJUGA]. You are receiving this because you were mentioned. [Tracking image][https://github.com/notifications/beacon/AKCZFMREQOWB6H5PFO4LU2T2IHLULA5CNFSM6AAAAABT7XV5LKWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTUY6JB7I.gif] okay, that seems to work. so 2. and 3. are fine. can you look into 1? also, another thing we are not handling currently is when groups are created with overlapping samples. We don't have any way to run the DE analysis after excluding out the overlapping samples from the two groups. We should think of a way to do that and enable the analysis for the non-overlapping. I can make a note of it for another pr. let me know. because for those overlapping samples we might have to enable some other kinds of analyses. — Reply to this email directly, view it on GitHub[#2539 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AKCZFMR4CQVUVEO6ZZTYXYT2ILR6PAVCNFSM6AAAAABT7XV5LKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRWGY2TSMZXHE]. You are receiving this because you were mentioned. [Tracking image][https://github.com/notifications/beacon/AKCZFMUUQ7C2DNRDAEASOGL2ILR6PA5CNFSM6AAAAABT7XV5LKWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTUY7QOTG.gif]

yes, that is fine. this is not important for now. We can discuss the overlapping samples later. but you should make the fix for running the adjustment for edgeR when a covariate is selected wilcoxon button should hide (for now) since this adjustment is only applicable to edgeR. until we have enabled some batch correction or other adjustment models suitable for wilcoxon.

karishma-gangwani · 2025-01-02T16:17:39Z

also, it takes a significant amount of time to run the DE analysis after adding the variable. Is there a way to speed up the process? It shows 'Loading...' but looks like it is frozen.

robinpaul85 · 2025-01-02T18:20:25Z

also, it takes a significant amount of time to run the DE analysis after adding the variable. Is there a way to speed up the process? It shows 'Loading...' but looks like it is frozen.

Not that I can think of at the moment. That is why I had suggested precomputing earlier. I can try benchmarking with DESeq2 to see how fast it takes.

karishma-gangwani · 2025-01-02T18:37:51Z

also, it takes a significant amount of time to run the DE analysis after adding the variable. Is there a way to speed up the process? It shows 'Loading...' but looks like it is frozen.

Not that I can think of at the moment. That is why I had suggested precomputing earlier. I can try benchmarking with DESeq2 to see how fast it takes.

You could run DESeq2 and see (better in a separate branch). And Yes, maybe precomputing will be a better idea if it's going to take this long to run it on-the-fly. We can discuss with Xin once. Can you look into the batch correction for Wilcoxon? We can test that out as well in this pr.

server/utils/edge.R

karishma-gangwani · 2025-01-14T15:38:19Z

server/utils/edge.R


+        # Get the column numbers (in the HDF5 file) corresponding to the sample ID for case cohort
+        parse_sample_indices_time <- system.time({
        samples_indicies <- c()
        for (sample in cases) {


can optimize these for loops.

The HDF5 files requires sample indices not sample names. So, I think the current implementation makes sense.

That is fine. I am saying instead of using for loops you could use match instead to create your indices and then check against that. something like this. see if this is possible?

case_indices <- match(cases, samples) control_indices <- match(controls, samples) if (any(is.na(case_indices))) { missing_cases <- cases[is.na(case_indices)] print(paste(missing_cases, "not found")) quit(status = 1) }

karishma-gangwani · 2025-01-14T15:38:29Z

server/utils/edge.R

            } else {
                print (paste(sample,"not found"))
                quit(status = 1)
            }
        }

+        # Get the column numbers (in the HDF5 file) corresponding to the sample ID for control cohort
        for (sample in controls) {


this one too.

My reply is the same too as the previous comment.

karishma-gangwani · 2025-01-14T16:17:31Z

server/utils/edge.R

-#data %>% select(all_of(combined))
-#read_file_time_start <- Sys.time()
+case_sample_list <- c()
+control_sample_list <- c()
 if (exists(input$storage_type)==FALSE) {


this check with exists(input$storage_type) seems redundant. can get rid of it. since this is defined in your JSON input?

and also can get rid of the last else block. seems redundant. around line 109-110. can add quit(status = 1) where you are printing 'Unknown storage type' at line 106 instead and the remaining else block below it is not needed. it is repetitive.

karishma-gangwani · 2025-01-14T16:59:46Z

server/routes/termdb.DE.ts

-                    values[] // using integer sample id
+samplelst{}
+    groups[]
+            values[] // using integer sample id


line 66 comment is repeated.

This has now been removed

karishma-gangwani · 2025-01-14T17:01:06Z

server/routes/termdb.DE.ts

@@ -57,12 +72,16 @@ param{}
 	const group1names = [] as string[]


line 72-105 can be consolidated in a single function for group1 and group2 names.

I think we can do this in the following branch, when I implement DESeq2.

karishma-gangwani

line 177 in termdb.DE.ts, can just combine the logic for running default wilcoxon with the previous block and get rid of the last else statement. you can add the cutoff for sample size similar to the implementation where edgeR is selected as parameter in the first if block. basically can clean this part and consolidate it.

robinpaul85 · 2025-01-14T17:54:07Z

line 177 in termdb.DE.ts, can just combine the logic for running default wilcoxon with the previous block and get rid of the last else statement. you can add the cutoff for sample size similar to the implementation where edgeR is selected as parameter in the first if block. basically can clean this part and consolidate it.

This has now been removed.

karishma-gangwani

here are my results. It looks good to me now. We changed the hdf5 file to protein coding only and made optimizations to the code.

including non-coding genes:

no confounders:

Time taken to run edgeR: 152778 ms
line: Time to read JSON:  0.001  seconds
line: Time to read counts data:  29.145  seconds
line: Time to generate DGEList:  0.191  seconds
line: Time to filter by expression:  0.243  seconds
line: Normalization time:  14.02  seconds
line: Dispersion time:  78.597  seconds
line: Exact test time:  29.216  seconds
line: Time for multiple testing correction:  0.038  seconds


with 1 confounder (Molecular subtype):

Time taken to run edgeR: 762764 ms
line: Time to read JSON:  0.002  seconds
line: Time to read counts data:  27.745  seconds
line: Time to generate DGEList:  0.177  seconds
line: Time to filter by expression:  0.259  seconds
line: Normalization time:  14.539  seconds
line: Time for making design matrix:  0.002  seconds
line: Dispersion time:  606.712  seconds
line: Fit time:  76.012  seconds
line: Test statistics time:  35.846  seconds
line: Time for multiple testing correction:  0.03  seconds

only with protein coding genes:

no confounders:

line: Time to read JSON:  0.002  seconds
line: Time to read counts data:  6.949  seconds
line: Time to generate DGEList:  0.061  seconds
line: Time to filter by expression:  0.103  seconds
line: Normalization time:  7.351  seconds
line: Dispersion time:  46.281  seconds
line: Exact test time:  15.847  seconds
line: Time for multiple testing correction:  0.014  seconds

with 1 confounder (Molecular subtype):

line: Time to read JSON:  0.002  seconds
line: Time to read counts data:  6.92  seconds
line: Time to generate DGEList:  0.061  seconds
line: Time to filter by expression:  0.108  seconds
line: Normalization time:  7.637  seconds
line: Time for making design matrix:  0.002  seconds
line: Dispersion time:  344.394  seconds
line: Fit time:  43.227  seconds
line: Test statistics time:  20.326  seconds
line: Time for multiple testing correction:  0.012  seconds

…edgeR

xzhou82

thanks

robinpaul85 requested review from xzhou82, creilly8 and karishma-gangwani December 20, 2024 19:50

robinpaul85 marked this pull request as draft December 20, 2024 19:50

robinpaul85 force-pushed the DE_conf branch from 8ae53c8 to 14fbfb6 Compare December 20, 2024 22:22

robinpaul85 force-pushed the DE_conf branch from 14fbfb6 to 034291c Compare January 6, 2025 15:55

karishma-gangwani reviewed Jan 7, 2025

View reviewed changes

server/utils/edge.R Outdated Show resolved Hide resolved

robinpaul85 force-pushed the DE_conf branch 6 times, most recently from 895afc9 to cd5d055 Compare January 14, 2025 05:11

karishma-gangwani reviewed Jan 14, 2025

View reviewed changes

robinpaul85 force-pushed the DE_conf branch from cd5d055 to dc22a64 Compare January 14, 2025 17:47

robinpaul85 marked this pull request as ready for review January 14, 2025 21:06

karishma-gangwani self-requested a review January 14, 2025 21:07

karishma-gangwani approved these changes Jan 14, 2025

View reviewed changes

robinpaul85 and others added 18 commits January 14, 2025 15:14

Aded termdb to client side

d4b73c1

Added server side

80e950d

Parsing sample ID confounding data

efdeef5

Passed confounding factor to edgeR

6cb814a

Added design matric in edgeR

db8b17d

Tentative implementation of adjustement of confounding factors using …

c7f9266

…edgeR

Adding replace/remove option for confounding factor

89b5d29

Small change in types

f4f4525

Small change in edgeR script

b5b6a36

Printing case and control lists from edgeR script

77aefb8

Show confounding factors only when edgeR is selected as the method

9cc7395

Using glmfit and glmLRT in edgeR

c49df21

Small change in model.matrix

89cf9ab

Fixed order of confounding factors

9833f30

Changed y

29a7abc

Added comments and benchamrking to edgeR script

88f7797

Removed duplicate else statement for wilcoxon test

e4e7e29

Improved readability of edgeR code

6fb765f

robinpaul85 force-pushed the DE_conf branch from e0405b8 to 6fb765f Compare January 14, 2025 21:15

xzhou82 approved these changes Jan 15, 2025

View reviewed changes

xzhou82 merged commit acfd4c5 into master Jan 15, 2025
3 checks passed

xzhou82 deleted the DE_conf branch January 15, 2025 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tentative implementation for adjusting of confounding factors using edgeR #2539

Tentative implementation for adjusting of confounding factors using edgeR #2539

robinpaul85 commented Dec 20, 2024 •

edited

Loading

karishma-gangwani commented Dec 30, 2024

robinpaul85 commented Dec 31, 2024 via email

karishma-gangwani commented Dec 31, 2024

robinpaul85 commented Jan 1, 2025 via email

karishma-gangwani commented Jan 2, 2025

karishma-gangwani commented Jan 2, 2025

robinpaul85 commented Jan 2, 2025

karishma-gangwani commented Jan 2, 2025

karishma-gangwani Jan 14, 2025

robinpaul85 Jan 14, 2025

karishma-gangwani Jan 14, 2025

karishma-gangwani Jan 14, 2025

robinpaul85 Jan 14, 2025

karishma-gangwani Jan 14, 2025

karishma-gangwani Jan 14, 2025

karishma-gangwani Jan 14, 2025

robinpaul85 Jan 14, 2025

karishma-gangwani Jan 14, 2025

robinpaul85 Jan 14, 2025

karishma-gangwani left a comment •

edited

Loading

robinpaul85 commented Jan 14, 2025

karishma-gangwani left a comment

xzhou82 left a comment

		@@ -57,12 +72,16 @@ param{}
		const group1names = [] as string[]

Tentative implementation for adjusting of confounding factors using edgeR #2539

Tentative implementation for adjusting of confounding factors using edgeR #2539

Conversation

robinpaul85 commented Dec 20, 2024 • edited Loading

Checklist

karishma-gangwani commented Dec 30, 2024

robinpaul85 commented Dec 31, 2024 via email

karishma-gangwani commented Dec 31, 2024

robinpaul85 commented Jan 1, 2025 via email

karishma-gangwani commented Jan 2, 2025

karishma-gangwani commented Jan 2, 2025

robinpaul85 commented Jan 2, 2025

karishma-gangwani commented Jan 2, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karishma-gangwani left a comment • edited Loading

Choose a reason for hiding this comment

robinpaul85 commented Jan 14, 2025

karishma-gangwani left a comment

Choose a reason for hiding this comment

xzhou82 left a comment

Choose a reason for hiding this comment

robinpaul85 commented Dec 20, 2024 •

edited

Loading

karishma-gangwani left a comment •

edited

Loading