GeneAnnotation & Cis-BP references need to be updated #2228

dannyconrad · 2024-11-01T19:26:35Z

This is somewhere between a bug and a feature request, but posting in Issues so that hopefully users (and future devs) see it: due to its age, ArchR has fallen behind on some of the references data it uses.

For example I've found that certain gene symbols provided in the default geneAnnotation object (at least for hg38) are outdated. Some examples include:
MAP1A --> METAP1
CASC4 --> GOLM2
C12orf49 --> SPRING1
ACTN1-AS1 --> ACTN1-DT

One notable group are the histone genes, which underwent a big re-naming along with the publication of this paper in 2022:
https://epigeneticsandchromatin.biomedcentral.com/articles/10.1186/s13072-022-00467-2
HIST1H1A --> H1-1
HIST4H4 --> H4C16

Many (though not all) gene set databases (GO, Hallmark, Reactome, etc.) and enrichment algorithms (i.e. EnrichR) use the updated symbols, so the old symbols get dropped from analysis and could lead to incorrect or missed results. Seurat has a function that attempts to find synonymous gene symbols but it's... imperfect. It will often mistakenly correct a current gene symbol if it matches an old symbol for another gene, i.e.:
TEP1 (a real gene on chr14) --X--> PTEN (a real gene on chr10, formerly known as TEP1)
HPR (a real gene on chr16) --X--> MT-HPR (an uncategorized gene on chrM, formerly known as HPR)

I wrote this function to try and fix gene names during analysis and exclude the "fixed" symbols that are already in use to avoid this problem but it would be better to have an officially updated gene annotation.

old <- getGeneAnnotation(proj)$gene$symbol %>% unname
old <- old[!is.na(old)]
new <- GeneSymbolThesarus(old)

fixGene <- function(genes = NULL, new.ref = new, named = F) {
  new <- new.ref %>% subset(. %ni% getGeneAnnotation(proj_4)$gene$symbol)
  new <- new %>% subset(. %ni% grep("^MT-",.,value=T))
  new <- new[names(new) %in% genes]
  if (named) { names(genes) <- genes }
  idx <- match(names(new),old)
  genes[idx] <- new
  genes
}

With the caveat that fixGene() might still be missing some other types of "incorrect corrections" that I haven't identified, this shows almost 800 gene symbols being reassigned:

> table(fixGene(old, new) == old)
FALSE  TRUE 
  793 24168

Here's example output where the vector contains the new symbols and the names are the old symbols:

> fixGene(old, new, named = T) %>% {subset(., names(.) != .)} %>% head
   LINC00982     TP73-AS1    LINC00337     C1orf158     C1orf195     FLJ37453 
 "PRDM16-DT"     "GFOD3P"    "ICMT-DT"    "CFAP107" "TMEM51-AS2"   "SPEN-AS1"

Also, Cis-BP released an updated motif collection v2.0 back in 2019:
https://www.nature.com/articles/s41588-019-0411-1

To my knowledge, the curated motifs that ArchR & chromVar use are derived from v1. Since these are such a core part of scATAC-seq analysis, it would make sense to make sure these are updated. While I'm not sure how much the PWMs themselves have changed, the set of transcription factors represented has changed dramatically, with many being removed and many being added which can profoundly influence analysis and biological interpretation:

It's unclear how the original set was curated for chromVar so perhaps the Greenleaf Lab can weigh in on how to best proceed.

The text was updated successfully, but these errors were encountered:

rcorces · 2024-11-01T19:26:45Z

Hi @dannyconrad! Thanks for using ArchR! Lately, it has been very challenging for me to keep up with maintenance of this package and all of my other
responsibilities as a PI. I have not been responding to issue posts and I have not been pushing updates to the software. We are actively searching to hire
a computational biologist to continue to develop and maintain ArchR and related tools. If you know someone who might be a good fit, please let us know!
In the meantime, your issue will likely go without a reply. Most issues with ArchR right not relate to compatibility. Try reverting to R 4.1 and Bioconductor 3.15.
Newer versions of Seurat and Matrix also are causing issues. Sorry for not being able to provide active support for this package at this time.

dannyconrad added the bug Something isn't working label Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GeneAnnotation & Cis-BP references need to be updated #2228

GeneAnnotation & Cis-BP references need to be updated #2228

dannyconrad commented Nov 1, 2024 •

edited

Loading

rcorces commented Nov 1, 2024

GeneAnnotation & Cis-BP references need to be updated #2228

GeneAnnotation & Cis-BP references need to be updated #2228

Comments

dannyconrad commented Nov 1, 2024 • edited Loading

rcorces commented Nov 1, 2024

dannyconrad commented Nov 1, 2024 •

edited

Loading