Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scregclust takes too long time #5

Open
Pang-Ka opened this issue Nov 21, 2024 · 4 comments
Open

scregclust takes too long time #5

Pang-Ka opened this issue Nov 21, 2024 · 4 comments

Comments

@Pang-Ka
Copy link

Pang-Ka commented Nov 21, 2024

Dear authors,
I used a dataset with 20541 features and 19840 samples to run scregclust with the following codes:
fit <- scregclust( z, genesymbols, is_regulator, penalization = seq(0.1, 0.5, 0.05), n_modules = 10L, n_cycles = 50L, noise_threshold = 0.05)

It took about 18 hours to complete one cycle of one penalization score. Is there any way to speed up this process?
Many thanks!

WeChat2969606b4b314c30c0de326b4df214ed

@cyianor
Copy link
Collaborator

cyianor commented Nov 21, 2024

Hi,
How many tentative regulators are you considering, i.e. what is sum(is_regulator)?

Also, what kind of computer are you running this on?

@Pang-Ka
Copy link
Author

Pang-Ka commented Nov 25, 2024

Hi,
I found the sum(is_regulator) is 1, is it right?
My code is as follows:
z <- GetAssayData(sc, layer = "scale.data")
dim(z) #[1] 20527 19840
out <- scregclust_format(z, mode = "TF")
is_regulator <- out$is_regulator

I run the code with my MacBook Pro (Apple M1 Pro, MacOS 14.6.1)

@idacharlottalarsson
Copy link
Collaborator

Hi,

Adding to Felix's answer above, I would strongly recommend that you include a feature selection step before running scregclust, e.g. only including the top XX most variable genes in your gene expression matrix. For sc-data generated from a UMI-based platform, e.g. 10X Genomics, I usually normalize my data using sctransform which by default returns the 3000 most variable genes (but this can be of course be changed to return more or less genes). Usually, including all 20000+ genes is unnecessary since most of them have zero variance and are therefore uninformative (as you can see already in your first run more than 16000 genes were placed in the noise cluster). Hopefully this will speed things up.

Let us know how it goes!

@Pang-Ka
Copy link
Author

Pang-Ka commented Nov 26, 2024

Hi,

Adding to Felix's answer above, I would strongly recommend that you include a feature selection step before running scregclust, e.g. only including the top XX most variable genes in your gene expression matrix. For sc-data generated from a UMI-based platform, e.g. 10X Genomics, I usually normalize my data using sctransform which by default returns the 3000 most variable genes (but this can be of course be changed to return more or less genes). Usually, including all 20000+ genes is unnecessary since most of them have zero variance and are therefore uninformative (as you can see already in your first run more than 16000 genes were placed in the noise cluster). Hopefully this will speed things up.

Let us know how it goes!

Thank you for your suggestions. I tried to use top 2000 variable genes to run scregclust, but I still got some ERROR:
out <- scregclust_format(z, mode = "TF")
genesymbols <- VariableFeatures(sc)
sample_assignment <- out$sample_assignment
is_regulator <- out$is_regulator
mouseTF <- read.table('/Volumes/Data/allTFs_mm.txt')
ix<-which(genesymbols %in% mouseTF$V1)
is_regulator[ix]<-1
fit <- scregclust(z, genesymbols, is_regulator, penalization = seq(0.1, 0.5, 0.05), n_modules = 10L, n_cycles = 50L, noise_threshold = 0.05)

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants