Skip to content

an intelligent genotype imputation reference recommendation method with convolutional neural networks

License

Notifications You must be signed in to change notification settings

shishuo16/RefRGim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RefRGim

an intelligent genotype imputation reference reconstruction method with convolutional neural networks based on genetic similarity of individuals from input data and current references. RefRGim has been pretrained with single nucleotide polymorphism data of individuals in 1000 Genomes Project, which are from 26 different populations across the world. A population was delimited as a haplotype group.

Background

Genotype imputation is a statistical method for estimating missing genotypes that are not directly assayed or sequenced in study from a denser haplotype reference panel.Existing methods usually performed well on imputing common frequency variants, but not ideal for rare variants, which typically play important roles in many complex human diseases and phenotype studies. Previous studies showed the population similarity between study and reference panel is one of the key features influencing the imputation performance.

Installation

Requirements

Install

Note: we have pre-trained our model with all variants data in the 1000 Genomes project and generated 251 convolutional neural networks (CNNs) from 22 autosomes. Considering from the aspect of saving download time and computer memory, you can only choose one chr file in 1KGP_CNN_net to download. Or if you do not care about a little more downloading time and computer memory, you can download RefRGim using default method:

### Download
git clone --recursive https://github.com/shishuo16/RefRGim.git

Download and process reference panels of the 1000 Genomes Project

mkdir raw_1KGP
cd raw_1KGP
### Download
sh ../scripts/downloadfile.sh
### Process
sh ../scripts/RefRGim_process.ref.sh

Running

RefRGim takes compressed study vcf file (VCFv4.2 format), RefRGim path, path of raw reference panels of the 1000 Genomes Project, and a prefix for output files as inputs and output the study specified reference panels, most genetic-similar haplotype group for each input individuals, the genetic-similar probability matrix of haplotype groups and input individuals, retrained convolutional neural network, and convolutional neural network retraining process.

Demo

./RefRGim example/test.vcf.gz ./ raw_1KGP example/test.out

Outputs

  • test.out.SuperPopulation/chr*.vcf.gz
    • study specified reference panels. Haplotypes whose population belongs to a same super population were merged into one vcf file
  • test.out.populations
    • study individual classification result: individual name, haplotype group, and super haplotype group
  • test.out.population.probs
    • probability matrix of input individuals and 26 haplotype group
  • test.out_net
    • directory that saves the retrained weights and parameters for the model
  • test.out_training_info
    • directory that saves graph of weights, biases, and loss function in retraining process, which can be display using tensorboard:
    tensorboard --logdir=test.out_training_info
    

Maintainers

@shishuo16

Citations

Shuo Shi, Qiheng Qian, Shuhuan Yu, Qi Wang, Jinyue Wang, Jingyao Zeng, Zhenglin Du, Jingfa Xiao, RefRGim: an intelligent reference panel reconstruction method for genotype imputation with convolutional neural networks, Briefings in Bioinformatics, Volume 22, Issue 6, November 2021, bbab326, https://doi.org/10.1093/bib/bbab326

About

an intelligent genotype imputation reference recommendation method with convolutional neural networks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published