Skip to content

vpc-ccg/genion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Genion

An accurate tool to detect gene fusion from long transcriptomics reads.

Genion comes with a stand-alone binary and helper Snakemake to assist mapping and preparing reference files.

Installation

You can install genion through conda, docker or from source.

Installation with bioconda

conda install -c bioconda genion

Check out mamba. A faster conda reimplementation.

Installation with Docker

git clone https://github.com/vpc-ccg/genion
cd genion/docker
docker build . -t genion:latest

After genion is built with Docker, you can run it with the following command

docker run --user=$UID -v /path/to/inputs:/input -v /path/to/outputdir:/output genion [args]

Installation from Source

Dependencies Version
c++ gcc >= 9 or clang >= 8
zlib >= 1.2.11
git clone https://github.com/vpc-ccg/genion
cd genion
make

Input Output Description

Input

Genion requires following input files to run:

  • Mapping file of transcriptomics long reads (paf): Genion does not do mapping. It accepts mappings in paf format. You can use any splice-aware long read to whole genome mapper (and convert sam to paf using paftools if mapper doesn't output paf).
  • Long reads file(fast{a,q}): These are used for filter low complexity sequence filtering
  • Gene annotation file (GTF)
  • Sequence similarity file: This is used to filter candidates from genes with similar sequences. This file is produced by all to all mapping cDNA reference file with itself. It can be created using genion snakemake or command line given in the Required References section. This file is a tab separated 2 column file containing transcript pairs. This file can be produced using ENSEMBL cDNA reference and following command line:
minimap2 [cdna.fa] [cdna.fa] -X -t [threads] -2 -c -o [cdna.selfalign.paf]
cat [cdna.selfalign.paf] | cut -f1,6 | sed 's/_/\t/g' | awk 'BEGIN{OFS=\"\\t\";}{print substr($1,1,15),substr($2,1,15),substr($3,1,15),substr($4,1,15);}' | awk '$1!=$3' | sort | uniq > [cdna.selfalign.tsv]
  • Duplication annotation: Genomic segmental duplication annotation. This used to filter out candidates that come from copies of the same segmental duplication. For hg38, it can be downloaded from ftp://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/genomicSuperDups.txt.gz.

Running Genion

./genion
    -i          /path/to/input/fastq           
    --gtf       /path/to/annotation/gtf        
    --gpaf      /path/to/genomic/mapping/paf    
    -s          /path/to/gene/homology/tsv      
    -d          /path/to/genomicSuperDups.txt   
    -o          /path/to/output/tsv            

Output

  • [output]: Contains called gene fusions and readthrough. It is a tab separated sheet with the following columns.
    • gene1.id::gene2.id gene1.name::gene2.name ffigf-score FiN-score supporting-reads normal-counts pass-fail-code ranges
      • ffigf-score of fusion A::B : Number of supporting A::B fusion reads divided by number of fusion reads mapping to gene A or gene B but not both.
      • FiN-score: Number of supporting A::B fusion reads divided by sum of number of normal A reads and B normal reads (normal being reads not supporting any gene fusions)
      • normal-counts: ';' separated list of non-chimeric read counts of the member genes.
      • pass-fail-code: PASS:GF for gene fusions, PASS:RT for readthroughs, FAIL::reason::... for filtered candidates.
      • ranges: semicolon separated set of (median) genomic intervals showing which part of the member genes are expressed (sorted in the same order of gene ids)
  • [output].fail: Contains called filtered fusion candidates in the same column format.
  • [output].log: Contains genomic positions of the supporting reads formatted as:
    • read-id gene1-id gene1-range gene2-id gene2-range

A Small Example

Download

You can download a small simulated example from: https://figshare.com/articles/dataset/Small_gene_fusion_simulated_long_read_dataset/17253821

Contents

example.fastq                  #Simulated reads fastq
example.paf                    #Mapping in paf format
genomicSuperDups.txt           #Genomic segmental duplication annotation
Homo_sapiens.GRCh38.97.gtf     #Gene annotation
cdna.self.tsv                  #Homology information

Preparation

tar xzvf small_example.tar.gz

Running

cd small_example
genion -i example.fastq -d genomicSuperDups.txt --gtf Homo_sapiens.GRCh38.97.gtf -g example.paf -s cdna.self.tsv -t 1 -o output.tsv 

Upon the successful run of genion on this example dataset, output.tsv should look like this.

Genion Snakemake

Additionally, we provide a snakemake file to help running genion.

  • Maps Long reads
  • Downloads the duplication annotation
  • Prepares the Sequence similarity file
  • Runs Genion

Genion Snakemake dependencies

Dependencies Version
c++ gcc >= 9 or clang >= 8
zlib >= 1.2.11
Python >= 3.7
snakemake >= 5.3.0
deSALT >= 1.5.5
minimap2 >= 2.17
paftools

Snakemake dependencies can be installed using conda/mamba

conda create --file genion.env --name genion-env
conda activate genion-env

Snakemake Project Configuration

In order to run Genion, you need to create a project configuration file namely config.yaml. This configuration consists of a number mandatory settings and some optional advance settings. Below is the list of the all the settings that you can set in your project.

config-paramater-name Type Description
path Mandatory Full path to project directory.
reference-dna Mandatory Full path to the DNA reference
reference-cdna Mandatory Full path to the cDNA reference
annotation-gtf Mandatory Full path to the GTF annotation
rawdata-base Mandatory Location of the input fastq files relative to path.
or rawdata Mandatory Full path to the location of the input fastq files
input Mandatory A list of input files per sample. See the following example
ext Optional (Important, if this is wrong snakemake won't run correctly) extension of the fastq files used in input (fastq,fastq.gz,fq,fq.gz) default:fastq
genion-binary Optional Path to genion binary, should be set if genion is not in $PATH
desalt-index Optional If not provided, reference will be indexed on the run
analysis-base Optional Location of intermediate files relative to path. default: {path}/analysis
or analysis Optional Full path to the location of intermediate files. default: {path}/analysis
results-base Optional Location of final results relative to the path. default: {path}/results
or results Optional Full path to the location of final results. default: {path}/results
wg-aligner Optional Mapper to use (deSALT, minimap2) default: deSALT

Input formatting in the config file

Each input requires a fastq file and type. Type is used to configure parameters by the mapper. Following are the available types of input:

type Technology
ccs PacBio SMRT CCS reads: error rate 1%
clr PacBio SMRT CLR reads: error rate 15%
ont1d Oxford Nanopore 1D reads: error rate > 20%
ont2d Oxford Nanopore 2D reads: error rate > 12%

The following a an example of config-yaml for Nanopore and Pacbio runs for a sample

path:
    /path/to/project/directory

annotation-gtf/:
    /path/to/annotation/gtf
reference-dna:
    /path/to/reference/dna
reference-cdna:
    /path/to/reference/cdna
desalt-index:
    /path/to/index/dir/
ext:
    fastq.gz
wg-aligner:
    deSALT
input:
    "A":
        type:
            clr
        fastq:
            - A_clr.fastq.gz
    "B":
        type:
            ont1d
        fastq:
            - B_ont.fastq.gz

Snakemake Input Output file structure

[path]/
├── rawdata                       
│   ├── A_clr.fastq.gz
|   └── B_ont.fastq.gz
├── analysis  (intermediate files)
│   ├── A   
|   └── B
└── results                       
    ├── A.fusions.tsv
    ├── A.readthrough.tsv
    ├── B.fusions.tsv
    └── B.readthrough.tsv

For the input/output file structure description, snakemake configuration comes with two options each for rawdata, analysis and results. You can use -base suffix (like rawdata-base). This way snakemake will know that given path is relative to the project path. Or you can directly use rawdata to enter absolute path. This may be helpful if input files are not in the project directory.

Running Genion Snakemake

After preparing a config file following the Project Configuration section, you can run snakemake with the following command.

snakemake -j [number-of-threads] --config-file [path-to-config-file]