You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is somewhere between a bug and a feature request, but posting in Issues so that hopefully users (and future devs) see it: due to its age, ArchR has fallen behind on some of the references data it uses.
For example I've found that certain gene symbols provided in the default geneAnnotation object (at least for hg38) are outdated. Some examples include:
MAP1A --> METAP1
CASC4 --> GOLM2
C12orf49 --> SPRING1
ACTN1-AS1 --> ACTN1-DT
Many (though not all) gene set databases (GO, Hallmark, Reactome, etc.) and enrichment algorithms (i.e. EnrichR) use the updated symbols, so the old symbols get dropped from analysis and could lead to incorrect or missed results. Seurat has a function that attempts to find synonymous gene symbols but it's... imperfect. It will often mistakenly correct a current gene symbol if it matches an old symbol for another gene, i.e.:
TEP1 (a real gene on chr14) --X--> PTEN (a real gene on chr10, formerly known as TEP1)
HPR (a real gene on chr16) --X--> MT-HPR (an uncategorized gene on chrM, formerly known as HPR)
I wrote this function to try and fix gene names during analysis and exclude the "fixed" symbols that are already in use to avoid this problem but it would be better to have an officially updated gene annotation.
old <- getGeneAnnotation(proj)$gene$symbol %>% unname
old <- old[!is.na(old)]
new <- GeneSymbolThesarus(old)
fixGene <- function(genes = NULL, new.ref = new, named = F) {
new <- new.ref %>% subset(. %ni% getGeneAnnotation(proj_4)$gene$symbol)
new <- new %>% subset(. %ni% grep("^MT-",.,value=T))
new <- new[names(new) %in% genes]
if (named) { names(genes) <- genes }
idx <- match(names(new),old)
genes[idx] <- new
genes
}
With the caveat that fixGene() might still be missing some other types of "incorrect corrections" that I haven't identified, this shows almost 800 gene symbols being reassigned:
To my knowledge, the curated motifs that ArchR & chromVar use are derived from v1. Since these are such a core part of scATAC-seq analysis, it would make sense to make sure these are updated. While I'm not sure how much the PWMs themselves have changed, the set of transcription factors represented has changed dramatically, with many being removed and many being added which can profoundly influence analysis and biological interpretation:
It's unclear how the original set was curated for chromVar so perhaps the Greenleaf Lab can weigh in on how to best proceed.
The text was updated successfully, but these errors were encountered:
Hi @dannyconrad! Thanks for using ArchR! Lately, it has been very challenging for me to keep up with maintenance of this package and all of my other
responsibilities as a PI. I have not been responding to issue posts and I have not been pushing updates to the software. We are actively searching to hire
a computational biologist to continue to develop and maintain ArchR and related tools. If you know someone who might be a good fit, please let us know!
In the meantime, your issue will likely go without a reply. Most issues with ArchR right not relate to compatibility. Try reverting to R 4.1 and Bioconductor 3.15.
Newer versions of Seurat and Matrix also are causing issues. Sorry for not being able to provide active support for this package at this time.
This is somewhere between a bug and a feature request, but posting in Issues so that hopefully users (and future devs) see it: due to its age, ArchR has fallen behind on some of the references data it uses.
For example I've found that certain gene symbols provided in the default geneAnnotation object (at least for hg38) are outdated. Some examples include:
MAP1A --> METAP1
CASC4 --> GOLM2
C12orf49 --> SPRING1
ACTN1-AS1 --> ACTN1-DT
One notable group are the histone genes, which underwent a big re-naming along with the publication of this paper in 2022:
https://epigeneticsandchromatin.biomedcentral.com/articles/10.1186/s13072-022-00467-2
HIST1H1A --> H1-1
HIST4H4 --> H4C16
Many (though not all) gene set databases (GO, Hallmark, Reactome, etc.) and enrichment algorithms (i.e. EnrichR) use the updated symbols, so the old symbols get dropped from analysis and could lead to incorrect or missed results. Seurat has a function that attempts to find synonymous gene symbols but it's... imperfect. It will often mistakenly correct a current gene symbol if it matches an old symbol for another gene, i.e.:
TEP1 (a real gene on chr14) --X--> PTEN (a real gene on chr10, formerly known as TEP1)
HPR (a real gene on chr16) --X--> MT-HPR (an uncategorized gene on chrM, formerly known as HPR)
I wrote this function to try and fix gene names during analysis and exclude the "fixed" symbols that are already in use to avoid this problem but it would be better to have an officially updated gene annotation.
With the caveat that
fixGene()
might still be missing some other types of "incorrect corrections" that I haven't identified, this shows almost 800 gene symbols being reassigned:Here's example output where the vector contains the new symbols and the names are the old symbols:
Also, Cis-BP released an updated motif collection v2.0 back in 2019:
https://www.nature.com/articles/s41588-019-0411-1
To my knowledge, the curated motifs that ArchR & chromVar use are derived from v1. Since these are such a core part of scATAC-seq analysis, it would make sense to make sure these are updated. While I'm not sure how much the PWMs themselves have changed, the set of transcription factors represented has changed dramatically, with many being removed and many being added which can profoundly influence analysis and biological interpretation:
It's unclear how the original set was curated for chromVar so perhaps the Greenleaf Lab can weigh in on how to best proceed.
The text was updated successfully, but these errors were encountered: