Note that this repo is no longer maintained.
Major release. Current version should now work with haploid, diploid, phased, and unphased (IUPAC) outputs.
vcf2fasta.py is Python program that extracts FASTA alignments from VCF files given a GFF file with feature coordinates and a reference FASTA file.
The reference must be indexed using:
samtools faidx ref.fa
And the VCF file should be tabix indexed and compressed:
bgzip my_vcf_file.vcf
tabix my_vcf_file.vcf.gz
For most GFF3 formats, no modification is needed for the GFF file if the structure follow Ensembl. However, it is important to keep the whole structure of the GFF file, including complete gene features. If CDSs are the focus they should be accompanied by it's corresponding gene or parent feature:
- gene
- CDS/exon
Similarly of the focus are introns:
- gene
- intron
Or transcripts:
- gene
- transcript
etc.. Alternatively, all features can be left on the GFF. However, the --feat | -e
argument must be used at all times.
If multiple transcript isoforms are on the GFF, all of them will be fetched.
pysam
art
pip3 install pysam art
Run with -h
option for more details
usage: vcf2fasta.py [-h] --fasta GENOME --vcf VCF --gff GFF --feat FEAT
[--blend] [--inframe] [--out OUT] [--addref] [--skip]
Converts regions/intervals in the genome into FASTA alignments
provided a VCF file, a GFF file, and FASTA reference.
optional arguments:
-h, --help show this help message and exit
--fasta GENOME, -f GENOME
FASTA file with the reference genome.
--vcf VCF, -v VCF a tabix-indexed VCF file.
--gff GFF, -g GFF GFF file.
--feat FEAT, -e FEAT feature/annotation in the GFF file. (i.e. gene, CDS, intron)
--blend, -b concatenate GFF entries of FEAT into a single alignment. Useful for CDS. (default: False)
--inframe, -i force the first codon of the sequence to be inframe. Useful for incomplete CDS. (default: False)
--out OUT, -o OUT provide a name for the output directory (optional)
--addref, -r include the reference sequence in the FASTA alignment (default: False)
--skip, -s skips features without variants (default: False)
All files must be indexed. So before running the code make sure
that your reference FASTA file is indexed:
samtools faidx genome.fas
BGZIP compress and TABIX index your VCF file:
bgzip variants.vcf
tabix variants.vcf.gz
The GFF file does not need to be indexed.
examples:
python vcf2fasta.py -f genome.fas -v variants.vcf.gz -g intervals.gff -e CDS
If running on CDS use the --blend | -b
option to concatenate coding sequences. Otherwise it will spit out FASTA alignments for each CDS.
Note: this script only works with SNPs and not indel variants Use the Python version for consistency
Converts a VCF file to a FASTA alignment provided a reference genome and a GFF file
Given a FASTA reference genome, a multi-sample VCF file and a GFF file, this script will generate FASTA alignments of any feature found in the GFF file; for instance, coding sequences (CDS). Gene names will be taken from the first id in field 9 of the GFF. It is important that the GFF file is sorted by position. The script will take diploid, phased, and/or haploid data. It is important to note that the script currently only takes very standard VCF formats; for instance, those generated by haplotype-based algorithms (HaplotypeCaller and FreeBayes), which include gaps and multi-nucleotide variants, will not work.
git clone https://github.com/santiagosnchez/vcf2fasta
cd vcf2fasta
chmod +x vcf2fasta.pl
sudo cp vcf2fasta.pl /usr/local/bin
Use the -h
flag for more details:
perl vcf2fasta.pl -h
Usage:
perl vcf2fasta.pl -f <fasta-ref> -v <vcf-file> -g <gff-file> -e <gff-feature> [ --ref ] [ --phased ]
option --ref will include the reference sequence.
option --phase informs the program that the VCF is pased.
defaults ignore the two options.
examples:
perl vcf2fasta.pl -f ref.fas -v snps.vcf -g annotation.gff -e CDS
perl vcf2fasta.pl -f ref.fas -v snps.phased.vcf -g annotation.gff -e CDS --phased
perl vcf2fasta.pl -f ref.fas -v snps.phased.vcf -g annotation.gff -e CDS --phased --ref