Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long Computation Time for addClusters with Multiple Samples #2167

Open
limiao20 opened this issue May 27, 2024 · 3 comments
Open

Long Computation Time for addClusters with Multiple Samples #2167

limiao20 opened this issue May 27, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@limiao20
Copy link

Long Computation Time for addClusters with Multiple Samples
Dear ArchR Development Team,
I am currently using the ArchR package for analyzing my single-cell ATAC-seq data. I have encountered an issue where the addClusters step is taking an unexpectedly long time to complete when analyzing multiple samples.
Detailed information is as follows:
System Information:R Version:4.2.3,ArchR Version:1.0.2
Dataset Information:Number of Samples: 35,File Format:atac_fragments.tsv.gz.https://www.kpmp.org/available-data(Data sources)

Issue Description:
When using addClusters with method = "Seurat" and resolution = 0.8, the computation time is significantly longer than expected. For instance, processing 9 samples takes around 9 hours, while processing 35 samples has taken approximately 180 hours and is still running.
I have attached the log file for addcluster for your reference.
ArchR-addClusters-256707ea3f45e-Date-2024-05-19_Time-18-07-53.242222.log

Code Snippet
Here is the code I am using for the clustering analysis:

library(ArchR)
addArchRGenome("hg38")
addArchRThreads(threads = 10)

inputFiles <- c(#Input 35 atac_fragments.tsv.gz documents)
sampleNames <- c(#35 sample names)

ArrowFiles <- createArrowFiles(
inputFiles = inputFiles,
sampleNames = sampleNames,
filterTSS = 4,
filterFrags = 1000,
addTileMat = TRUE,
addGeneScoreMat = TRUE
)

doubScores <- addDoubletScores(
input = ArrowFiles,
k = 10, #Refers to how many cells near a "pseudo-doublet" to count.
knnMethod = "UMAP",
LSIMethod = 1
)

projHeme1 <- ArchRProject(
ArrowFiles = ArrowFiles,
outputDirectory = "HemeTutorial",
copyArrows = TRUE #This is recommened so that if you modify the Arrow files you have an original copy for later usage.
)

df <- getCellColData(projHeme1, select = c("log10(nFrags)", "TSSEnrichment"))
df
p <- ggPoint(
x = df[,1],
y = df[,2],
colorDensity = TRUE,
continuousSet = "sambaNight",
xlabel = "Log10 Unique Fragments",
ylabel = "TSS Enrichment",
xlim = c(log10(500), quantile(df[,1], probs = 0.99)),
ylim = c(0, quantile(df[,2], probs = 0.99))
) + geom_hline(yintercept = 4, lty = "dashed") + geom_vline(xintercept = 3, lty = "dashed")

p
ggsave("TSS-vs-Frags.pdf", plot = p)

p1 <- plotGroups(
ArchRProj = projHeme1,
groupBy = "Sample",
colorBy = "cellColData",
name = "TSSEnrichment",
plotAs = "ridges"
)
p1
p2 <- plotGroups(
ArchRProj = projHeme1,
groupBy = "Sample",
colorBy = "cellColData",
name = "TSSEnrichment",
plotAs = "violin",
alpha = 0.4,
addBoxPlot = TRUE
)
p2
p3 <- plotGroups(
ArchRProj = projHeme1,
groupBy = "Sample",
colorBy = "cellColData",
name = "log10(nFrags)",
plotAs = "ridges"
)
p3
p4 <- plotGroups(
ArchRProj = projHeme1,
groupBy = "Sample",
colorBy = "cellColData",
name = "log10(nFrags)",
plotAs = "violin",
alpha = 0.4,
addBoxPlot = TRUE
)
p4
library(ggplot2)
library(gridExtra)

combined_plot <- grid.arrange(p1, p2, p3, p4, ncol = 2)

ggsave("QC-Sample-Statistics.pdf", plot = combined_plot, width = 8, height = 8)

p5<- plotFragmentSizes(ArchRProj = projHeme1)
p5
p6 <- plotTSSEnrichment(ArchRProj = projHeme1)
p6
combined_plot2 <- grid.arrange(p5, p6, ncol = 1)

ggsave("QC-Sample-FragSizes-TSSProfile.pdf", plot = combined_plot2, width = 8, height = 8)
saveArchRProject(ArchRProj = projHeme1, outputDirectory = "Save-ProjHeme1", load = FALSE)

library(ArchR)
projHeme1 <- loadArchRProject(path = "Save-ProjHeme1")
print(projHeme1)

projHeme2 <- filterDoublets(projHeme1)
projHeme2
#2.1 Dimensionality Reduction with ArchR
projHeme2 <- addIterativeLSI(
ArchRProj = projHeme2,
useMatrix = "TileMatrix",
name = "IterativeLSI",
iterations = 2,
clusterParams = list( #See Seurat::FindClusters
resolution = c(0.2),
sampleCells = 10000,
n.start = 10
),
varFeatures = 25000,
dimsToUse = 1:30
)

saveRDS(projHeme2, file = "projHeme2")
projHeme2 <- readRDS("projHeme2")

#2.2harmony
projHeme2 <- addHarmony(
ArchRProj = projHeme2,
reducedDims = "IterativeLSI",
name = "Harmony",
groupBy = "Sample"
)

saveRDS(projHeme2, file = "projHeme3")

projHeme3 <- readRDS(file = "projHeme3")

#3cluster
projHeme3 <- addClusters(
input = projHeme3,
reducedDims = "IterativeLSI",
method = "Seurat",
name = "Clusters",
resolution = 0.8
)

saveArchRProject(ArchRProj = projHeme3, outputDirectory = "Save-ProjHeme3", load = FALSE)

head(projHeme3$Clusters)
table(projHeme3$Clusters)
cM <- confusionMatrix(paste0(projHeme3$Clusters), paste0(projHeme3$Sample))
cM
library(pheatmap)
cM <- cM / Matrix::rowSums(cM)
p7 <- pheatmap::pheatmap(
mat = as.matrix(cM),
color = paletteContinuous("whiteBlue"),
border_color = "black"
)
p7
ggsave("cluster-heatmap.pdf", plot = p7)

#4Single-cell Embeddings
projHeme3 <- addUMAP(
ArchRProj = projHeme3,
reducedDims = "IterativeLSI",
name = "UMAP",
nNeighbors = 30,
minDist = 0.5,
metric = "cosine"
)

p8 <- plotEmbedding(ArchRProj = projHeme3, colorBy = "cellColData", name = "Sample", embedding = "UMAP")
p9 <- plotEmbedding(ArchRProj = projHeme3, colorBy = "cellColData", name = "Clusters", embedding = "UMAP")
combined_plot3 <- grid.arrange(p8, p9, ncol = 1)

ggsave("Plot-UMAP-Sample-Clusters.pdf", plot = combined_plot3, width = 8, height = 12)

saveArchRProject(ArchRProj = projHeme3, outputDirectory = "Save-ProjHeme4", load = FALSE)

Troubleshooting Steps Taken
Verified the data format and content.
Ensured sufficient computational resources (CPU and memory).
Enabled parallel processing.
Could you please help me understand why the analysis is taking so long and suggest any potential optimizations or solutions?
Any guidance or recommendations would be greatly appreciated.

@limiao20 limiao20 added the bug Something isn't working label May 27, 2024
@rcorces
Copy link
Collaborator

rcorces commented May 27, 2024

Hi @limiao20! Thanks for using ArchR! Lately, it has been very challenging for me to keep up with maintenance of this package and all of my other
responsibilities as a PI. I have not been responding to issue posts and I have not been pushing updates to the software. We are actively searching to hire
a computational biologist to continue to develop and maintain ArchR and related tools. If you know someone who might be a good fit, please let us know!
In the meantime, your issue will likely go without a reply. Most issues with ArchR right not relate to compatibility. Try reverting to R 4.1 and Bioconductor 3.15.
Newer versions of Seurat and Matrix also are causing issues. Sorry for not being able to provide active support for this package at this time.

@ananyakohli17
Copy link

Hi!
Were you able to get past the computational time issue? I am facing a similar problem.

@ananyakohli17
Copy link

Hi! Were you able to get past the computational time issue? I am facing a similar problem.

To be more specific, I have 540178 cells across 18 samples. My clustering after running harmony and using the default parameters has been running for the last 4 days. I wanted to ask if there is a certain threshold to the number of cells for which ArchR works efficiently? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants