Skip to content

Pipeline to add genotyped SNPs back post-imputation (specifically, post-imputation with BEAGLE v4 or later)

Notifications You must be signed in to change notification settings

naim-panjwani/Post_imputation_pipeline

Repository files navigation

Post imputation pipeline

This pipeline aims to add genotyped SNPs that get thrown out due to their absence from the reference back into the imputed VCFs post-imputation (specifically, post-imputation with BEAGLE v4 or later). BEAGLE version 5 was used here.

Step 0

You will need the following files, programs, libraries:

  • File containing the list of SNPs to add back (can be raw)
  • File containing the list of SNPs that are imputed (check out the misc_sample_scripts/extract_imputed_SNP_names.sh sample script)
  • Will Rayner's strand alignment file for your specific platform (see https://www.well.ox.ac.uk/~wrayner/strand/)
  • The 1000 Genomes reference fasta file
  • R packages: argparse, data.table
  • bgzip, tabix, vcftools, and bcftools
  • PLINK (v1.90b6.9 64-bit (4 Mar 2019) used here)

Step 1

  • Check list of SNPs to add back to ensure that they are not already imputed
Rscript 01-cleanup_SNPs_to_add_back.R <SNP_list_to_add_back_filename> <imputed_SNP_list_filename> <output_filename>

Example:

Rscript Post_imputation_pipeline/01-cleanup_SNPs_to_add_back.R 17-Omni25_SNPs_to_add_back.txt 17-1-imputed_SNP_list.txt 17-2-Omni25_SNPs_to_add_back.txt

For the SNP list to add back, you could simply list out all SNPs in the bim file prior to the conform-gt step.
You may invoke help on the script via the -h argument:

Rscript 01-cleanup_SNPs_to_add_back.R -h
usage: 01-cleanup_SNPs_to_add_back.R [--] [--help] [--opts OPTS] SNP_list_to_add_back_filename imputed_SNP_list_filename output_filename

Cleanup tool to check whether the SNPs that we want to add back are SNPs that are not already present in the imputed files

positional arguments:
  SNP_list_to_add_back_filename			Text file with the list of SNPs we would like to add back
  imputed_SNP_list_filename			A one-column text file of all SNP ID's in the imputed files
  output_filename			Desired output filename

flags:
  -h, --help			show this help message and exit

optional arguments:
  -x, --opts OPTS			RDS file containing argument values

Step 2

  • Extract the SNPs using PLINK
plink --bfile <bplink_filename> --extract <step1_snp_list_output> --make-bed --out <output_filename>

Example:

plink --bfile 08-1-underscore_to_dash_id_update --extract 17-2-Omni25_SNPs_to_add_back.txt --make-bed --out 17-SNPs_to_add_back

Step 3

  • Align the SNPs to the plus strand Since the Illumina SNP names were converted to rsid's, we need to convert them back for strand alignment using Will Rayner's update_build.sh script
bash 03-align_strand.sh <bPLINK_filename> <rsid_illumina_mapping_filename> <strand_filename> <output_filename>

Example:

bash 03-align_strand.sh 17-SNPs_to_add_back 18-SNP_renaming_1-3_and_1-4_merged.txt strand_files_1-4/InfiniumOmni2-5-8v1-4_A1-b37.strand 20-Strand_aligned_1-4

Step 4

  • Convert to VCF; fix the X,Y,MT and pseudoautosomal X chrom field; and align the reference allele to 1000 Genomes
bash 04-convert_and_fix_vcf.sh <bPLINK_filename> <output_filename>

Example:

bash 04-convert_and_fix_vcf.sh 20-Strand_aligned_1-4 23-SNPs_to_add_back_normalized.vcf.gz

Step 5

  • Format the VCF to match the format of BEAGLE (version 5) output VCF
python3 05-VCF_Reformatting.py <VCF_filename> <output_filename>

Example:

python3 Post_imputation_pipeline/05-VCF_Reformating.py 23-SNPs_to_add_back_normalized.vcf.gz 24-SNPs_to_add_back_reformatted.vcf

Script help details:

python3 05-VCF_Reformating.py -h
usage: 05-VCF_Reformating.py [-h] vcf_filename output_filename

Tool to format a raw VCF to match the format of BEAGLE (v5) output VCF

positional arguments:
  vcf_filename     The VCF filename that needs reformatting
  output_filename  Output filename

optional arguments:
  -h, --help       show this help message and exit

Step 6

  • Sort and compress
vcf-sort <step5_outputfilename> |bgzip -c > <step5_outputfilename>.gz
tabix -p vcf <step5_outputfilename>.gz
if [ -e <step5_outputfilename>.gz ]; then rm <step5_outputfilename>; fi

Example:

vcf-sort 24-SNPs_to_add_back_reformatted.vcf |bgzip -c >24-SNPs_to_add_back_reformatted.vcf.gz
tabix -p vcf 24-SNPs_to_add_back_reformatted.vcf.gz
if [ -e 24-SNPs_to_add_back_reformatted.vcf.gz.tbi ]; then rm 24-SNPs_to_add_back_reformatted.vcf; fi

Step 7

  • Merge and sort with the imputed files. If the imputed files are split by chromosome, then an example of how to combine these follows:
for i in {1..22}; do
  (tabix -h 16-JME_Round1_and_2_chr${i}_beagle5_imputed.vcf.gz ${i}: ; tabix 24-SNPs_to_add_back_reformatted.vcf.gz ${i}: ) |vcf-sort |bgzip -c >25-JME_Round1_and_2_chr${i}_imputed_all_snps_in.vcf.gz
  tabix -p vcf 25-JME_Round1_and_2_chr${i}_imputed_all_snps_in.vcf.gz
done
i="X"
(tabix -h 16-JME_Round1_and_2_chr${i}_beagle5_imputed.vcf.gz ${i}: ; tabix 24-SNPs_to_add_back_reformatted.vcf.gz ${i}: ) |vcf-sort |bgzip -c >25-JME_Round1_and_2_chr${i}_imputed_all_snps_in.vcf.gz
25-JME_Round1_and_2_chr${i}_imputed_all_snps_in.vcf.gz
i="Y"
tabix -h 24-SNPs_to_add_back_reformatted.vcf.gz ${i}: |vcf-sort |bgzip -c >25-JME_Round1_and_2_chr${i}_imputed_all_snps_in.vcf.gz
25-JME_Round1_and_2_chr${i}_imputed_all_snps_in.vcf.gz
i="MT"
tabix -h 24-SNPs_to_add_back_reformatted.vcf.gz ${i}: |vcf-sort |bgzip -c >25-JME_Round1_and_2_chr${i}_imputed_all_snps_in.vcf.gz
25-JME_Round1_and_2_chr${i}_imputed_all_snps_in.vcf.gz

The 16-JME_Round1_and_2_chr${i}_beagle5_imputed.vcf.gz are the imputed files, and 24-SNPs_to_add_back_reformatted.vcf.gz is the file generated from step 6

About

Pipeline to add genotyped SNPs back post-imputation (specifically, post-imputation with BEAGLE v4 or later)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published