-
Notifications
You must be signed in to change notification settings - Fork 1
Home
- Clinical implementation features
- Running MetaFusion clinical
- Singularity and Docker images
- Reference files
- Database File
- Viewing database tables
- Final Output File column descriptions
- Minimum Working Example
- Updating Database Tables
The clinical implementation of MetaFusion adds features and modifies MetaFusion to be suitable for clinical use:
- clustering by breakpoints alone instead of by both gene name and breakpoint
- selection of a single harmonized breakpoint from among all callers
- An SQLite database to track historical calls, clinically relevant calls, and false positives
- Annotation of calls with previous patient occurrence
- PFAM functional domains (both retained and removed)
- Junction sequence (based on harmonized breakpoints)
- Manual curation of clinically relevant calls, and false positives to allow for continued improvement of the tool
See the MWE section for a full tutorial
Note: MetaFusion assumes 0-based indexing for breakpoints in the input CFF. If using fusion callers other than the 8 we use in our manuscript, ensure that 1-based callers are converted to 0-based indexing before running MetaFusion.
bash MetaFusion.clinical.sh --outdir $outdir \
--cff $cff \
--gene_bed $gene_bed \
--annotate_exons \
--fusion_annotator \
--genome_fasta $genome_fasta \
--gene_info $gene_info \
--num_tools=2 \
--per_sample \
--recurrent_bedpe $recurrent_bedpe \
--scripts $fusiontools \
--database $database \
--update_hist \
--ref_dir $ref_dir
Below is an overview of MetaFusion's different flags:
--outdir : Output directory for a given dataset. Should have the format "$runs_dir/sample_name.$date"
--cff: CFF file for the dataset
--gene_bed: The MetaFusion-specific annotation file
--annotate_exons: Annotates exons
--fusion_annotator (Optional) : This flag can be used to turn FusionAnnotator on. This should only be done if the required "~/Metafusion/reference_files/ctat_genome_lib_build_dir" directory exists, and contains the following two files: "blast_pairs.idx" and "fusion_annot_lib.idx".
--genome_fasta (Optional): Human Genome reference .fasta
--gene_info: NCBI gene info file
--num_tools: Callerfilter N. Value can be set to any integer in the range (1, num_callers)
--per_sample: Mergd calls with identical breakpoints are listed on separate lines when they occur in different samples instead of creating a single entry with a comma-separated sample list
--recurrent_bedpe: File containing recurrent breakpoints from arriba's blacklist file. Used in blocklist step
--scripts: Path to scripts used by MetaFusion
--database: Absolute path to the SQLite database historical_fusions.db
--update_hist (Optional): When this flag is used, updates historical_fusions table with current run. Removing this flag allows a run to be annotated with previous calls without adding calls in current run to the historical_fusions table
--ref_dir: Directory containing MetaFusion's annotation files
Flags labelled "Optional" can be removed and MetaFusion will still work. The locations of files are specified in the test data run script "RUN_MetaFusion.Docker.sh".
Singularity image can be downloaded from the following link:
https://figshare.com/articles/software/MetaFusion_clinical_simg/14454342
Docker image can be downloaded as follows:
docker pull mapostolides/metafusion:readxl_writexl
reference_files.tar.gz
should be downloaded from: https://figshare.com/articles/dataset/MetaFusion-Clinical_reference_files/24239689
Then unzipped and placed inside MetaFusion-Clinical code directory at /MetaFusion-Clinical/reference_files
Note that files "new_bed.total.Oct-1-2020.uniq.bed.gz" and "hgTables.gene_symbol.ENSG.ENST.ENSP.Nov-13-2020.tsv.gz" need to be unzipped individually as well to allow the files to be read
The database allows MetaFusion clinical to be customizable for a clinic's own purposes.
The database historical_fusions.db is a database containing 3 tables: historical_fusions, clinical_fusions, and false_positives.
-
Historical Fusion Table: Updated automatically when MetaFusion is run. A record of all calls made by callers, before filtration. Used to annotate new calls with calls seen before. Can be used to inform curation of the clinical_fusions, and false_positives tables, allowing for identification of clinically relevant fusions and false positives based on recurrence.
-
Clinical Fusions Table: A manually updated table containing clinically relevant fusions. Serves as a record of all previously seen clinically relevant calls, used to annotate new calls with calls seen before, and allows fusion call to bypass MetaFusion's filters and end up in the final output file.
-
False Positives Table: A manually curated false positive list. Calls in current run that match based on gene name are removed from MetaFusion's final output. Breakpoints are not considered.
To view the contents of the database tables, the following script can be used:
VIEW_database_tables.sh
Ensure the "view" option is selected, and choose the database table you wish to view. An excel spreadsheet will be generated showing you the contents of the table chosen.
For the historical_fusions table, "view" is the only permitted operation, since it is updated automatically when MetaFusion.clinical is run. Two files are generated for this table:
historical_fusions.view.2021-04-22.10:08:09.xlsx
samples_run_so_far.2021-04-22.10:08:09.xlsx
For the false_positives table, the following file will be produced:
false_positives.view.2021-04-22.09:52:56.xlsx
For the clinical_fusions table, the following file will be produced:
clinical_fusions.view.2021-04-22.09:50:34.xlsx
The final MetaFusion.clinical output file is final.n2.cluster.xlsx
Column descriptions are as follows:
gene1 Head gene.
gene2 Tail gene
num_tools The number of tools which call this event
max_split_cnt junction crossing reads (maximum value among all callers)
max_span_cnt junction flanking reads (maximum value among all callers)
frame The combined frame information from callers which provide it, comma-delimeted
cancer_db_hits Database hits extracted from FusionAnnotator output
samples The sample ID
chr1 Head gene chromosome
breakpoint_1 Head gene breakpoint
chr2 Tail gene chromosome
breakpoint_2 Tail gene breakpoint
inferred_fusion_type Category assigned by MetaFusion
disease Disease assigned to this sample by user in the CFF file
tools Comma-separated list of fusion caller names
rna_type1 The type of RNA of the head gene (e.g. mRNA, lincRNA, etc.)
rna_type2 The type of RNA of the tail gene
strand1 The strand of the head gene
strand2 The strand of the tail gene
clinical_samples Previous samples in which this fusion is found, from the clinical_fusions database table
num_clinical_samples The number of previous samples in which this fusion is found, from the clinical_fusions database table
prev_samples Previous samples in which this fusion is found, from the historical_fusions database table
num_prev_samples The number of previous samples in which this fusion is found, from the historical_fusions database table
junction_sequence The junction sequence surrounding the breakpoint, extracted from reference using MetaFusion's breakpoints
domains_kept_gene1 PFAM protein domains upstream of the breakpoint of gene1, which are kept in the chimeric RNA
domains_removed_gene1 PFAM protein domains downstream of the breakpoint of gene1, which are not present in the chimeric RNA
domains_kept_gene2 PFAM protein domains downstream of the breakpoint of gene2, which are kept in the chimeric RNA
domains_removed_gene2 PFAM protein domains upstream of the breakpoint of gene2, which are not present in the chimeric RNA
gene1_on_bnd Specifies whether the breakpoint of gene1 is exactly on an exon-intron boundary
gene1_close_to_bnd Specifies whether the breakpoint of gene1 is close to an exon-intron boundary
gene2_on_bnd Specifies whether the breakpoint of gene2 is exactly on an exon-intron boundary
gene2_close_to_bnd Specifies whether the breakpoint of gene2 is close to an exon-intron boundary
exon1 The IGV coordinates of the exon immediately upstream of breakpoint_1, or which the breakpoint is within
exon2 The IGV coordinates of the exon immediately downstream of breakpoint_2, or which the breakpoint is within
sample_type (Tumor/Normal)
fusion_IDs Comma-separated list of unique fusion identifiers which correspond to separate calls in the annotated CFF file
Note: This tutorial requires scripting experience. Absolute paths and file names must be properly set where needed
Before starting this tutorial, ensure you have downloaded the reference files and pulled the docker image
The minimum working example (MWE) shows two runs each with a set of patients
https://github.com/ccmbioinfo/MetaFusion-Clinical/blob/master/MWE
The empty database, historical_database.EMPTY.db
, can be copied and renamed historical_database.db
to use in this tutorial.
First, start the Docker environment. Make sure that 8G of RAM (Memory) is set in Docker "Resources" tab, otherwise the program will crash.
docker run -it --entrypoint /bin/bash -v /Users/maposto/MetaFusion-Clinical:/Users/maposto/MetaFusion-Clinical mapostolides/metafusion:readxl_writexl
This creates the path /Users/maposto/MetaFusion-Clinical
inside the running container.
Then, navigate to the scripts directory. All scripts will be run in this directory
cd /Users/maposto/MetaFusion-Clinical/scripts
Starting with an empty database (assign database file name and path appropriately), run MetaFusion-Clinical on run1.cff
which contains 3 patients: Two with a BCR--ABL1 fusion and one with a KIAA1549--BRAF fusion.
To do this, run the script RUN1.MWE.sh
Note that the absolute paths for database=
, fusiontools=
, ref_dir=
, outdir=
, cff=
must be set
The output file in excel format, final.n2.cluster.filt.xlsx
, or the plain text format, final.n2.cluster.filt
, will be produced inside the directory run1
Next, add BCR--ABL1 and KIAA1549--BRAF fusions to the clinical_fusions database table. This is important because there is a KIAA1549--BRAF fusion called by only one caller in Run 2 which will not pass the filters otherwise
To do this, run the script UPDATE_clinical_fusions.MWE.sh
Note that absolute paths for database=
, excel=
and scripts=
directories must be set
Now that the clinical_fusions database table is loaded with the results of run1, run on run2.cff
, which contains 4 patients: One with a BCR--ABL1 fusion and three with a KIAA1549--BRAF fusion. Note that in patient7, only one caller calls the fusion KIAA1549--BRAF. It's been added to the clinical_fusions database table, which allows this call to be put into the final output file.
This second run will get annotated with information in the clinical_fusions and historical_fusions database tables, and can be confirmed by looking at the clinical_samples, num_clinical_samples, prev_samples and num_prev_samples columns of the output file final.n2.cluster.filt.xlsx
.
Note that the absolute paths for database=
, fusiontools=
, ref_dir=
, outdir=
, cff=
must be set
The directories with all output files from the two runs have been pre-run, and are in run1.out
and run2.out
to confirm that your runs worked correctly. Your own runs are in directories run1
and run2
.
The database file historical_database.LOADED.db
is what your database should look like after completing the tutorial.
For more test data to play around with, sample CFF files can be found here:
https://github.com/ccmbioinfo/MetaFusion-Clinical/tree/master/test_data/cff
In order to improve MetaFusion’s usefulness, the clinical_fusions and false_positives database tables should be updated as new samples are run. The script UPDATE_clinical_fusions.MWE.sh
can be used for this
This script takes 4 arguments, which are specified inside the file:
# Absolute path to database
database=/absolute/path/to/historical_fusions.db
# Absolute path to Excel spreadsheet with selected calls
excel=/absolute/path/to/final.n2.cluster.CLIN.xlsx
# Operation "add"
operation=add
# Table to update "clinical_fusions" or "false_positives"
table=clinical_fusions
The above script has the following variables that need to be set:
- historical_fusions.db: the database, with an absolute path
- final.n2.cluster.CLIN.xlsx: The Excel spreadsheet containing a subset of MetaFusion’s output, which you have decided are clinically relevant fusions. Ideally, one should make a copy of the “final.n2.cluster.xlsx” spreadsheet provided by MetaFusion and delete the rows which are not being added, leaving only clinically relevant rows which are to be added to the table.
When using this script, one should be inside a chosen directory where an Excel spreadsheet containing the contents of the clinical_fusions or false_positives table will be generated specifying the date and time, as follows:
clinical_fusions.2021-03-22.15:55:01.xlsx
clinical_fusions.updated.2021-03-22.15:55:01.xlsx
One can then open the above spreadsheets to compare the original and updated tables to confirm that the desired fusions have been added correctly.
Notes:
- The Excel spreadsheet must be a subset of MetaFusion output. This is the only compatible format.
- The rows that are to be removed by deleting the entire row, and not just the text within that row. There should be no blank rows between calls.