Skip to content

Converts a VCF file to a FASTA alignment provided a reference genome and a GFF file

License

Notifications You must be signed in to change notification settings

santiagosnchez/vcf2fasta

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vcf2fasta

Note that this repo is no longer maintained.

Major release. Current version should now work with haploid, diploid, phased, and unphased (IUPAC) outputs.

vcf2fasta.py is Python program that extracts FASTA alignments from VCF files given a GFF file with feature coordinates and a reference FASTA file.

Preprocessing

The reference must be indexed using:

samtools faidx ref.fa

And the VCF file should be tabix indexed and compressed:

bgzip my_vcf_file.vcf
tabix my_vcf_file.vcf.gz

For most GFF3 formats, no modification is needed for the GFF file if the structure follow Ensembl. However, it is important to keep the whole structure of the GFF file, including complete gene features. If CDSs are the focus they should be accompanied by it's corresponding gene or parent feature:

  • gene
  • CDS/exon

Similarly of the focus are introns:

  • gene
  • intron

Or transcripts:

  • gene
  • transcript

etc.. Alternatively, all features can be left on the GFF. However, the --feat | -e argument must be used at all times.

If multiple transcript isoforms are on the GFF, all of them will be fetched.

Requirements

  • pysam
  • art
pip3 install pysam art

Options

Run with -h option for more details

usage: vcf2fasta.py [-h] --fasta GENOME --vcf VCF --gff GFF --feat FEAT
                    [--blend] [--inframe] [--out OUT] [--addref] [--skip]

        Converts regions/intervals in the genome into FASTA alignments
        provided a VCF file, a GFF file, and FASTA reference.

optional arguments:
  -h, --help            show this help message and exit
  --fasta GENOME, -f GENOME
                        FASTA file with the reference genome.
  --vcf VCF, -v VCF     a tabix-indexed VCF file.
  --gff GFF, -g GFF     GFF file.
  --feat FEAT, -e FEAT  feature/annotation in the GFF file. (i.e. gene, CDS, intron)
  --blend, -b           concatenate GFF entries of FEAT into a single alignment. Useful for CDS. (default: False)
  --inframe, -i         force the first codon of the sequence to be inframe. Useful for incomplete CDS. (default: False)
  --out OUT, -o OUT     provide a name for the output directory (optional)
  --addref, -r          include the reference sequence in the FASTA alignment (default: False)
  --skip, -s            skips features without variants (default: False)

        All files must be indexed. So before running the code make sure
        that your reference FASTA file is indexed:

        samtools faidx genome.fas

        BGZIP compress and TABIX index your VCF file:

        bgzip variants.vcf
        tabix variants.vcf.gz

        The GFF file does not need to be indexed.

        examples:
        python vcf2fasta.py -f genome.fas -v variants.vcf.gz -g intervals.gff -e CDS

If running on CDS use the --blend | -b option to concatenate coding sequences. Otherwise it will spit out FASTA alignments for each CDS.

Old instructions for the Perl version

Note: this script only works with SNPs and not indel variants Use the Python version for consistency

vcf2fasta.pl

Converts a VCF file to a FASTA alignment provided a reference genome and a GFF file

Given a FASTA reference genome, a multi-sample VCF file and a GFF file, this script will generate FASTA alignments of any feature found in the GFF file; for instance, coding sequences (CDS). Gene names will be taken from the first id in field 9 of the GFF. It is important that the GFF file is sorted by position. The script will take diploid, phased, and/or haploid data. It is important to note that the script currently only takes very standard VCF formats; for instance, those generated by haplotype-based algorithms (HaplotypeCaller and FreeBayes), which include gaps and multi-nucleotide variants, will not work.

Installation

git clone https://github.com/santiagosnchez/vcf2fasta
cd vcf2fasta
chmod +x vcf2fasta.pl
sudo cp vcf2fasta.pl /usr/local/bin

Running the code

Use the -h flag for more details:

perl vcf2fasta.pl -h
Usage:
perl vcf2fasta.pl -f <fasta-ref> -v <vcf-file> -g <gff-file> -e <gff-feature> [ --ref ] [ --phased ]

option --ref will include the reference sequence.
option --phase informs the program that the VCF is pased.
defaults ignore the two options.

examples:
perl vcf2fasta.pl -f ref.fas -v snps.vcf -g annotation.gff -e CDS
perl vcf2fasta.pl -f ref.fas -v snps.phased.vcf -g annotation.gff -e CDS --phased
perl vcf2fasta.pl -f ref.fas -v snps.phased.vcf -g annotation.gff -e CDS --phased --ref

About

Converts a VCF file to a FASTA alignment provided a reference genome and a GFF file

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published