Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GeneAnnotation & Cis-BP references need to be updated #2228

Open
dannyconrad opened this issue Nov 1, 2024 · 1 comment
Open

GeneAnnotation & Cis-BP references need to be updated #2228

dannyconrad opened this issue Nov 1, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@dannyconrad
Copy link

dannyconrad commented Nov 1, 2024

This is somewhere between a bug and a feature request, but posting in Issues so that hopefully users (and future devs) see it: due to its age, ArchR has fallen behind on some of the references data it uses.

For example I've found that certain gene symbols provided in the default geneAnnotation object (at least for hg38) are outdated. Some examples include:
MAP1A --> METAP1
CASC4 --> GOLM2
C12orf49 --> SPRING1
ACTN1-AS1 --> ACTN1-DT

One notable group are the histone genes, which underwent a big re-naming along with the publication of this paper in 2022:
https://epigeneticsandchromatin.biomedcentral.com/articles/10.1186/s13072-022-00467-2
HIST1H1A --> H1-1
HIST4H4 --> H4C16

Many (though not all) gene set databases (GO, Hallmark, Reactome, etc.) and enrichment algorithms (i.e. EnrichR) use the updated symbols, so the old symbols get dropped from analysis and could lead to incorrect or missed results. Seurat has a function that attempts to find synonymous gene symbols but it's... imperfect. It will often mistakenly correct a current gene symbol if it matches an old symbol for another gene, i.e.:
TEP1 (a real gene on chr14) --X--> PTEN (a real gene on chr10, formerly known as TEP1)
HPR (a real gene on chr16) --X--> MT-HPR (an uncategorized gene on chrM, formerly known as HPR)

I wrote this function to try and fix gene names during analysis and exclude the "fixed" symbols that are already in use to avoid this problem but it would be better to have an officially updated gene annotation.

old <- getGeneAnnotation(proj)$gene$symbol %>% unname
old <- old[!is.na(old)]
new <- GeneSymbolThesarus(old)

fixGene <- function(genes = NULL, new.ref = new, named = F) {
  new <- new.ref %>% subset(. %ni% getGeneAnnotation(proj_4)$gene$symbol)
  new <- new %>% subset(. %ni% grep("^MT-",.,value=T))
  new <- new[names(new) %in% genes]
  if (named) { names(genes) <- genes }
  idx <- match(names(new),old)
  genes[idx] <- new
  genes
}

With the caveat that fixGene() might still be missing some other types of "incorrect corrections" that I haven't identified, this shows almost 800 gene symbols being reassigned:

> table(fixGene(old, new) == old)
FALSE  TRUE 
  793 24168 

Here's example output where the vector contains the new symbols and the names are the old symbols:

> fixGene(old, new, named = T) %>% {subset(., names(.) != .)} %>% head
   LINC00982     TP73-AS1    LINC00337     C1orf158     C1orf195     FLJ37453 
 "PRDM16-DT"     "GFOD3P"    "ICMT-DT"    "CFAP107" "TMEM51-AS2"   "SPEN-AS1" 

Also, Cis-BP released an updated motif collection v2.0 back in 2019:
https://www.nature.com/articles/s41588-019-0411-1

To my knowledge, the curated motifs that ArchR & chromVar use are derived from v1. Since these are such a core part of scATAC-seq analysis, it would make sense to make sure these are updated. While I'm not sure how much the PWMs themselves have changed, the set of transcription factors represented has changed dramatically, with many being removed and many being added which can profoundly influence analysis and biological interpretation:
image

It's unclear how the original set was curated for chromVar so perhaps the Greenleaf Lab can weigh in on how to best proceed.

@dannyconrad dannyconrad added the bug Something isn't working label Nov 1, 2024
@rcorces
Copy link
Collaborator

rcorces commented Nov 1, 2024

Hi @dannyconrad! Thanks for using ArchR! Lately, it has been very challenging for me to keep up with maintenance of this package and all of my other
responsibilities as a PI. I have not been responding to issue posts and I have not been pushing updates to the software. We are actively searching to hire
a computational biologist to continue to develop and maintain ArchR and related tools. If you know someone who might be a good fit, please let us know!
In the meantime, your issue will likely go without a reply. Most issues with ArchR right not relate to compatibility. Try reverting to R 4.1 and Bioconductor 3.15.
Newer versions of Seurat and Matrix also are causing issues. Sorry for not being able to provide active support for this package at this time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants