This repository provides a systematic framework for mapping gene annotations between species, with a focus on mapping human (GENCODE v47) and mouse (GENCODE vM25) annotations to various target species including primates and rodents.
Cross-species genomics requires high-quality genome assemblies and gene annotations. While genome assemblies are increasingly available through efforts like the Vertebrate Genome Project, gene annotations often lag behind. This repository provides:
- Automated downloading of source and target genome assemblies
- Systematic mapping of GENCODE annotations using Liftoff
- Creation of genome packages for downstream analysis
Located in config/source_genomes.tsv
:
Genome | Version | Source | Annotation |
---|---|---|---|
Human | GRCh38/hg38 | UCSC | GENCODE v47 comprehensive |
Human | GRCh38/hg38 | UCSC | GENCODE v47 basic |
Mouse | GRCm38/mm10 | UCSC | GENCODE vM25 |
Located in config/target_genomes.tsv
:
- Rhesus macaque
- rheMac8 (UCSC download)
- rheMac10 (UCSC download)
- Crab-eating macaque
- macFas6 (NCBI download)
- Marmoset
- mCalJac1 (Genomeark download)
- calJac4 (UCSC download)
- Pig-tailed macaque
- mMacNem1 (Genomeark download)
- Norway rat
- rn6 (UCSC download)
- rn7 (UCSC download)
- Pig
- susScr11 (UCSC download)
Processed data is organized under output/genomes/
with the following structure:
output/genomes/
├── {genome}/ # e.g., rheMac10/
│ ├── {genome}.fa # Genome FASTA
│ └── annotations/
│ └── {target_genome}-{source_genome}-{annotation_version}.gtf.gz
The following table shows all currently available lifted gene annotations:
Target Genome | Human (hg38) | Mouse (mm10) |
---|---|---|
calJac4 | gencode.v44.basic, gencode.v47.basic, gencode.v47.comp | - |
macFas6 | gencode.v44.basic, gencode.v47.basic, gencode.v47.comp | - |
mCalJac1 | gencode.v44.basic, gencode.v47.basic, gencode.v47.comp | - |
mMacNem1 | gencode.v44.basic, gencode.v47.basic, gencode.v47.comp | - |
rheMac8 | gencode.v44.basic, gencode.v47.basic, gencode.v47.comp | - |
rheMac10 | gencode.v44.basic, gencode.v47.basic, gencode.v47.comp | - |
rn6 | - | gencode.vM25.basic, gencode.vM25.comp |
rn7 | - | gencode.vM25.basic, gencode.vM25.comp |
susScr11 | gencode.v44.basic, gencode.v47.basic, gencode.v47.comp | - |
Each annotation is available as a gzipped GTF file in the respective genome's annotations directory. For example:
rheMac10/
├── rheMac10.fa
└── annotations/
├── rheMac10-hg38-gencode.v47.basic.gtf.gz
├── rheMac10-hg38-gencode.v47.comp.gtf.gz
└── rheMac10-mm10-gencode.v44.basic.gtf.gz
- Install this respository and dependencies using provided conda environment:
git clone git@github.com:pfenninglab/custom_ArchR_genomes_and_annotations.git
cd custom_ArchR_genomes_and_annotations
conda env create -f config/conda_environment.yml
- Download source/target genomes:
./scripts/download-genome.sh -g rheMac10 \
-f https://hgdownload.soe.ucsc.edu/goldenPath/rheMac10/bigZips/rheMac10.fa.gz \
-n gencode.v47.basic \
-t https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.basic.annotation.gtf.gz
- Run Liftoff gene mapping:
./scripts/liftoff-genes.sh -s hg38 -t rheMac10 -a gencode.v47.basic
If you use these resources, please cite:
- Liftoff: Shumate A, et al. (2021) Bioinformatics
- GENCODE: Frankish A, et al. (2021) Nucleic Acids Research
- Original gene annotations: Please cite the data resource DOI:
Phan, BaDoi; Pfenning, Andreas (2022): Alternate gene annotations for rat, macaque, and marmoset for single cell RNA and ATAC analyses.
Carnegie Mellon University. Dataset. https://doi.org/10.1184/R1/21176401.v1
Issues and pull requests welcome! See CONTRIBUTING.md for guidelines.