Skip to content
mapostolides edited this page Mar 28, 2024 · 18 revisions

Metafusion clinical pipeline

Table of Contents

  1. Clinical implementation features
  2. Running MetaFusion clinical
  3. Singularity and Docker images
  4. Reference files
  5. Database File
  6. Viewing database tables
  7. Final Output File column descriptions
  8. Minimum Working Example
  9. Updating Database Tables

Clinical implementation features

The clinical implementation of MetaFusion adds features and modifies MetaFusion to be suitable for clinical use:

  • clustering by breakpoints alone instead of by both gene name and breakpoint
  • selection of a single harmonized breakpoint from among all callers
  • An SQLite database to track historical calls, clinically relevant calls, and false positives
  • Annotation of calls with previous patient occurrence
  • PFAM functional domains (both retained and removed)
  • Junction sequence (based on harmonized breakpoints)
  • Manual curation of clinically relevant calls, and false positives to allow for continued improvement of the tool

Running MetaFusion clinical

See the MWE section for a full tutorial

Note: MetaFusion assumes 0-based indexing for breakpoints in the input CFF. If using fusion callers other than the 8 we use in our manuscript, ensure that 1-based callers are converted to 0-based indexing before running MetaFusion.

bash MetaFusion.clinical.sh --outdir $outdir \
                 --cff $cff  \
                 --gene_bed $gene_bed \
                 --annotate_exons \
                 --fusion_annotator \
                 --genome_fasta $genome_fasta \
                 --gene_info $gene_info \
                 --num_tools=2  \
                 --per_sample \
                 --recurrent_bedpe $recurrent_bedpe \
                 --scripts $fusiontools \
		 --database $database \
                 --update_hist \
                 --ref_dir $ref_dir

Below is an overview of MetaFusion's different flags:

--outdir : Output directory for a given dataset. Should have the format "$runs_dir/sample_name.$date"

--cff: CFF file for the dataset

--gene_bed: The MetaFusion-specific annotation file

--annotate_exons: Annotates exons

--fusion_annotator (Optional) : This flag can be used to turn FusionAnnotator on. This should only be done if the required "~/Metafusion/reference_files/ctat_genome_lib_build_dir" directory exists, and contains the following two files: "blast_pairs.idx" and "fusion_annot_lib.idx".

--genome_fasta (Optional): Human Genome reference .fasta

--gene_info: NCBI gene info file

--num_tools: Callerfilter N. Value can be set to any integer in the range (1, num_callers)

--per_sample: Mergd calls with identical breakpoints are listed on separate lines when they occur in different samples instead of creating a single entry with a comma-separated sample list

--recurrent_bedpe: File containing recurrent breakpoints from arriba's blacklist file. Used in blocklist step

--scripts: Path to scripts used by MetaFusion

--database: Absolute path to the SQLite database historical_fusions.db

--update_hist (Optional): When this flag is used, updates historical_fusions table with current run. Removing this flag allows a run to be annotated with previous calls without adding calls in current run to the historical_fusions table

--ref_dir: Directory containing MetaFusion's annotation files

Flags labelled "Optional" can be removed and MetaFusion will still work. The locations of files are specified in the test data run script "RUN_MetaFusion.Docker.sh".

Singularity and Docker images

Singularity image can be downloaded from the following link:
https://figshare.com/articles/software/MetaFusion_clinical_simg/14454342

Docker image can be downloaded as follows:

docker pull mapostolides/metafusion:readxl_writexl

Reference files

reference_files.tar.gz should be downloaded from: https://figshare.com/articles/dataset/MetaFusion-Clinical_reference_files/24239689

Then unzipped and placed inside MetaFusion-Clinical code directory at /MetaFusion-Clinical/reference_files

Note that files "new_bed.total.Oct-1-2020.uniq.bed.gz" and "hgTables.gene_symbol.ENSG.ENST.ENSP.Nov-13-2020.tsv.gz" need to be unzipped individually as well to allow the files to be read

Database File

The database allows MetaFusion clinical to be customizable for a clinic's own purposes.

The database historical_fusions.db is a database containing 3 tables: historical_fusions, clinical_fusions, and false_positives.

  • Historical Fusion Table: Updated automatically when MetaFusion is run. A record of all calls made by callers, before filtration. Used to annotate new calls with calls seen before. Can be used to inform curation of the clinical_fusions, and false_positives tables, allowing for identification of clinically relevant fusions and false positives based on recurrence.

  • Clinical Fusions Table: A manually updated table containing clinically relevant fusions. Serves as a record of all previously seen clinically relevant calls, used to annotate new calls with calls seen before, and allows fusion call to bypass MetaFusion's filters and end up in the final output file.

  • False Positives Table: A manually curated false positive list. Calls in current run that match based on gene name are removed from MetaFusion's final output. Breakpoints are not considered.

Viewing database tables

To view the contents of the database tables, the following script can be used:

VIEW_database_tables.sh

Ensure the "view" option is selected, and choose the database table you wish to view. An excel spreadsheet will be generated showing you the contents of the table chosen.

For the historical_fusions table, "view" is the only permitted operation, since it is updated automatically when MetaFusion.clinical is run. Two files are generated for this table:

  • historical_fusions.view.2021-04-22.10:08:09.xlsx
  • samples_run_so_far.2021-04-22.10:08:09.xlsx

For the false_positives table, the following file will be produced:

  • false_positives.view.2021-04-22.09:52:56.xlsx

For the clinical_fusions table, the following file will be produced:

  • clinical_fusions.view.2021-04-22.09:50:34.xlsx

Final Output File column descriptions

The final MetaFusion.clinical output file is final.n2.cluster.xlsx

Column descriptions are as follows:

gene1 Head gene.

gene2 Tail gene

num_tools The number of tools which call this event

max_split_cnt junction crossing reads (maximum value among all callers)

max_span_cnt junction flanking reads (maximum value among all callers)

frame The combined frame information from callers which provide it, comma-delimeted

cancer_db_hits Database hits extracted from FusionAnnotator output

samples The sample ID

chr1 Head gene chromosome

breakpoint_1 Head gene breakpoint

chr2 Tail gene chromosome

breakpoint_2 Tail gene breakpoint

inferred_fusion_type Category assigned by MetaFusion

disease Disease assigned to this sample by user in the CFF file

tools Comma-separated list of fusion caller names

rna_type1 The type of RNA of the head gene (e.g. mRNA, lincRNA, etc.)

rna_type2 The type of RNA of the tail gene

strand1 The strand of the head gene

strand2 The strand of the tail gene

clinical_samples Previous samples in which this fusion is found, from the clinical_fusions database table

num_clinical_samples The number of previous samples in which this fusion is found, from the clinical_fusions database table

prev_samples Previous samples in which this fusion is found, from the historical_fusions database table

num_prev_samples The number of previous samples in which this fusion is found, from the historical_fusions database table

junction_sequence The junction sequence surrounding the breakpoint, extracted from reference using MetaFusion's breakpoints

domains_kept_gene1 PFAM protein domains upstream of the breakpoint of gene1, which are kept in the chimeric RNA

domains_removed_gene1 PFAM protein domains downstream of the breakpoint of gene1, which are not present in the chimeric RNA

domains_kept_gene2 PFAM protein domains downstream of the breakpoint of gene2, which are kept in the chimeric RNA

domains_removed_gene2 PFAM protein domains upstream of the breakpoint of gene2, which are not present in the chimeric RNA

gene1_on_bnd Specifies whether the breakpoint of gene1 is exactly on an exon-intron boundary

gene1_close_to_bnd Specifies whether the breakpoint of gene1 is close to an exon-intron boundary

gene2_on_bnd Specifies whether the breakpoint of gene2 is exactly on an exon-intron boundary

gene2_close_to_bnd Specifies whether the breakpoint of gene2 is close to an exon-intron boundary

exon1 The IGV coordinates of the exon immediately upstream of breakpoint_1, or which the breakpoint is within

exon2 The IGV coordinates of the exon immediately downstream of breakpoint_2, or which the breakpoint is within

sample_type (Tumor/Normal)

fusion_IDs Comma-separated list of unique fusion identifiers which correspond to separate calls in the annotated CFF file

Minimum Working Example

Note: This tutorial requires scripting experience. Absolute paths and file names must be properly set where needed

Before starting this tutorial, ensure you have downloaded the reference files and pulled the docker image

The minimum working example (MWE) shows two runs each with a set of patients

https://github.com/ccmbioinfo/MetaFusion-Clinical/blob/master/MWE

The empty database, historical_database.EMPTY.db, can be copied and renamed historical_database.db to use in this tutorial.

First, start the Docker environment. Make sure that 8G of RAM (Memory) is set in Docker "Resources" tab, otherwise the program will crash.

docker run -it --entrypoint /bin/bash -v /Users/maposto/MetaFusion-Clinical:/Users/maposto/MetaFusion-Clinical mapostolides/metafusion:readxl_writexl

This creates the path /Users/maposto/MetaFusion-Clinical inside the running container.

Then, navigate to the scripts directory. All scripts will be run in this directory

cd /Users/maposto/MetaFusion-Clinical/scripts

Run 1

Starting with an empty database (assign database file name and path appropriately), run MetaFusion-Clinical on run1.cff which contains 3 patients: Two with a BCR--ABL1 fusion and one with a KIAA1549--BRAF fusion.

To do this, run the script RUN1.MWE.sh

Note that the absolute paths for database=, fusiontools=, ref_dir=, outdir=, cff= must be set

The output file in excel format, final.n2.cluster.filt.xlsx, or the plain text format, final.n2.cluster.filt, will be produced inside the directory run1

Update clinical fusions

Next, add BCR--ABL1 and KIAA1549--BRAF fusions to the clinical_fusions database table. This is important because there is a KIAA1549--BRAF fusion called by only one caller in Run 2 which will not pass the filters otherwise

To do this, run the script UPDATE_clinical_fusions.MWE.sh

Note that absolute paths for database=, excel= and scripts= directories must be set

Run 2

Now that the clinical_fusions database table is loaded with the results of run1, run on run2.cff, which contains 4 patients: One with a BCR--ABL1 fusion and three with a KIAA1549--BRAF fusion. Note that in patient7, only one caller calls the fusion KIAA1549--BRAF. It's been added to the clinical_fusions database table, which allows this call to be put into the final output file.

This second run will get annotated with information in the clinical_fusions and historical_fusions database tables, and can be confirmed by looking at the clinical_samples, num_clinical_samples, prev_samples and num_prev_samples columns of the output file final.n2.cluster.filt.xlsx.

Note that the absolute paths for database=, fusiontools=, ref_dir=, outdir=, cff= must be set

Output files

The directories with all output files from the two runs have been pre-run, and are in run1.out and run2.out to confirm that your runs worked correctly. Your own runs are in directories run1 and run2.

The database file historical_database.LOADED.db is what your database should look like after completing the tutorial.

Other test data

For more test data to play around with, sample CFF files can be found here:

https://github.com/ccmbioinfo/MetaFusion-Clinical/tree/master/test_data/cff

Updating Database Tables

In order to improve MetaFusion’s usefulness, the clinical_fusions and false_positives database tables should be updated as new samples are run. The script UPDATE_clinical_fusions.MWE.sh can be used for this

This script takes 4 arguments, which are specified inside the file:

# Absolute path to database
database=/absolute/path/to/historical_fusions.db

# Absolute path to Excel spreadsheet with selected calls
excel=/absolute/path/to/final.n2.cluster.CLIN.xlsx

# Operation "add"
operation=add

# Table to update "clinical_fusions" or "false_positives"
table=clinical_fusions

The above script has the following variables that need to be set:

  • historical_fusions.db: the database, with an absolute path
  • final.n2.cluster.CLIN.xlsx: The Excel spreadsheet containing a subset of MetaFusion’s output, which you have decided are clinically relevant fusions. Ideally, one should make a copy of the “final.n2.cluster.xlsx” spreadsheet provided by MetaFusion and delete the rows which are not being added, leaving only clinically relevant rows which are to be added to the table.

When using this script, one should be inside a chosen directory where an Excel spreadsheet containing the contents of the clinical_fusions or false_positives table will be generated specifying the date and time, as follows:

  • clinical_fusions.2021-03-22.15:55:01.xlsx
  • clinical_fusions.updated.2021-03-22.15:55:01.xlsx

One can then open the above spreadsheets to compare the original and updated tables to confirm that the desired fusions have been added correctly.

Notes:

  • The Excel spreadsheet must be a subset of MetaFusion output. This is the only compatible format.
  • The rows that are to be removed by deleting the entire row, and not just the text within that row. There should be no blank rows between calls.