vcf2fasta

Note that this repo is no longer maintained.

Major release. Current version should now work with haploid, diploid, phased, and unphased (IUPAC) outputs.

vcf2fasta.py is Python program that extracts FASTA alignments from VCF files given a GFF file with feature coordinates and a reference FASTA file.

Preprocessing

The reference must be indexed using:

samtools faidx ref.fa

And the VCF file should be tabix indexed and compressed:

bgzip my_vcf_file.vcf
tabix my_vcf_file.vcf.gz

For most GFF3 formats, no modification is needed for the GFF file if the structure follow Ensembl. However, it is important to keep the whole structure of the GFF file, including complete gene features. If CDSs are the focus they should be accompanied by it's corresponding gene or parent feature:

gene
CDS/exon

Similarly of the focus are introns:

gene
intron

Or transcripts:

gene
transcript

etc.. Alternatively, all features can be left on the GFF. However, the --feat | -e argument must be used at all times.

If multiple transcript isoforms are on the GFF, all of them will be fetched.

Requirements

pysam
art

pip3 install pysam art

Options

Run with -h option for more details

usage: vcf2fasta.py [-h] --fasta GENOME --vcf VCF --gff GFF --feat FEAT
                    [--blend] [--inframe] [--out OUT] [--addref] [--skip]

        Converts regions/intervals in the genome into FASTA alignments
        provided a VCF file, a GFF file, and FASTA reference.

optional arguments:
  -h, --help            show this help message and exit
  --fasta GENOME, -f GENOME
                        FASTA file with the reference genome.
  --vcf VCF, -v VCF     a tabix-indexed VCF file.
  --gff GFF, -g GFF     GFF file.
  --feat FEAT, -e FEAT  feature/annotation in the GFF file. (i.e. gene, CDS, intron)
  --blend, -b           concatenate GFF entries of FEAT into a single alignment. Useful for CDS. (default: False)
  --inframe, -i         force the first codon of the sequence to be inframe. Useful for incomplete CDS. (default: False)
  --out OUT, -o OUT     provide a name for the output directory (optional)
  --addref, -r          include the reference sequence in the FASTA alignment (default: False)
  --skip, -s            skips features without variants (default: False)

        All files must be indexed. So before running the code make sure
        that your reference FASTA file is indexed:

        samtools faidx genome.fas

        BGZIP compress and TABIX index your VCF file:

        bgzip variants.vcf
        tabix variants.vcf.gz

        The GFF file does not need to be indexed.

        examples:
        python vcf2fasta.py -f genome.fas -v variants.vcf.gz -g intervals.gff -e CDS

If running on CDS use the --blend | -b option to concatenate coding sequences. Otherwise it will spit out FASTA alignments for each CDS.

Old instructions for the Perl version

Note: this script only works with SNPs and not indel variants Use the Python version for consistency

vcf2fasta.pl

Converts a VCF file to a FASTA alignment provided a reference genome and a GFF file

Given a FASTA reference genome, a multi-sample VCF file and a GFF file, this script will generate FASTA alignments of any feature found in the GFF file; for instance, coding sequences (CDS). Gene names will be taken from the first id in field 9 of the GFF. It is important that the GFF file is sorted by position. The script will take diploid, phased, and/or haploid data. It is important to note that the script currently only takes very standard VCF formats; for instance, those generated by haplotype-based algorithms (HaplotypeCaller and FreeBayes), which include gaps and multi-nucleotide variants, will not work.

Installation

git clone https://github.com/santiagosnchez/vcf2fasta
cd vcf2fasta
chmod +x vcf2fasta.pl
sudo cp vcf2fasta.pl /usr/local/bin

Running the code

Use the -h flag for more details:

perl vcf2fasta.pl -h
Usage:
perl vcf2fasta.pl -f <fasta-ref> -v <vcf-file> -g <gff-file> -e <gff-feature> [ --ref ] [ --phased ]

option --ref will include the reference sequence.
option --phase informs the program that the VCF is pased.
defaults ignore the two options.

examples:
perl vcf2fasta.pl -f ref.fas -v snps.vcf -g annotation.gff -e CDS
perl vcf2fasta.pl -f ref.fas -v snps.phased.vcf -g annotation.gff -e CDS --phased
perl vcf2fasta.pl -f ref.fas -v snps.phased.vcf -g annotation.gff -e CDS --phased --ref

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
example		example
old_perl_version		old_perl_version
v2f		v2f
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
vcf2fasta.py		vcf2fasta.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vcf2fasta

Preprocessing

Requirements

Options

Old instructions for the Perl version

vcf2fasta.pl

Installation

Running the code

About

Releases

Packages

Languages

License

santiagosnchez/vcf2fasta

Folders and files

Latest commit

History

Repository files navigation

vcf2fasta

Preprocessing

Requirements

Options

Old instructions for the Perl version

vcf2fasta.pl

Installation

Running the code

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages