Analyses and figures generated for the manuscript "Accurate fusion transcript identification from long and short read isoform sequencing at bulk or single cell resolution" by "Qian Qin et al."
This repo focuses on analyses and figure generation for the manuscript. For computing benchmarking results that were analyzed here, please see the separate github repo: https://github.com/fusiontranscripts/LR-FusionBenchmarking
The JAFFAL (Badread) simulated fusion reads were obtained from: https://ndownloader.figshare.com/files/27676470
Fusion prediction results available here
Analysis: 1.Benchmark_Simulated_Fusions/1a.jaffalpaper_simulated_reads/analyze_jaffal_simdata_accuracy.Rmd (Figure 2b,c)
PacBio and ONT R10.4.1 fusion reads were simulated using PBSIM3. Reads are available at https://zenodo.org/records/10650516
Fusion prediction results are available for PacBio and ONT
Analysis: 1.Benchmark_Simulated_Fusions/1b.pbsim3_simulated_reads/simulated_reads_summary.Rmd (Figure 2d)
Simulated paralog fusions fastq file.
Evaluation of paralog fusion detection: 1.Benchmark_Simulated_Fusions/1c.sim_paralog_fusions/Examine_sim_paralog_fusion_detection.Rmd
SeraCare Fusion Mix v4 was sequenced using PacBio MAS-ISO-seq/Kinnex and by Illumina TruSeq (see links for reads in fastq format).
Fusion predictions for combined ctat-LR-Fusion w/ FusionInspector: https://github.com/broadinstitute/CTAT-LRF-Paper/tree/main/2.SeraCareFusions/2a.CTAT_SeraCareFusion/data/ctatLRF_FI
Analysis with no downsampling of long reads: 2.SeraCareFusions/2a.CTAT_SeraCareFusion/CTAT_SeraCareFusion.Rmd (Supp Figure S2)
Analysis with downsampling of long reads to match Illumina read sequenced numbers of bases: 2.SeraCareFusions/2a.CTAT_SeraCareFusion/2a.1.SubsampledSeraCareLR/Downsampled_LR_match_Illumina.Rmd (Figure 3a)
Fusion prediction results for all methods: https://github.com/fusiontranscripts/LR-FusionBenchmarking/tree/master/SeraCareFusions/prog_results
Analysis: 2.SeraCareFusions/2b.SeraCareFusionBenchmarking/SeraCareFusionAnalysis.Rmd (Figure 3b)
Note, you should install this customized R library for the feature-based UpsetR plots: https://github.com/fusiontranscripts/UpSetRbyFeature (see top of that README for installation instructions)
Fusion predictions for all the methods: https://github.com/fusiontranscripts/LR-FusionBenchmarking/tree/master/DepMap_Cell_Lines/prog_results
Example benchmarking requiring min 3 reads and min 2 methods agreeing: 3.DepMap9Lines/3b.DepMap9Lines_Benchmarking/__bmark_min-3-reads/DepMap9Lines_Benchmarking.min-3-read.Rmd (Figures 4a-e)
'Wisdom of the crowds' benchmarking summary across 30 different truth sets: 3.DepMap9Lines/3b.DepMap9Lines_Benchmarking/Examine_PR_AUC_varied_minReads.Rmd (Figure 4f)
Illumina-supported fusions truth set benchmarking: 3.DepMap9Lines/3b.DepMap9Lines_Benchmarking/3b.2.Illumina_TP_unique_FP_bmarking/Illum_TP_uniq_FP_summary.Rmd (Figure 4g)
Comparison of fusion breakpoint read counts for STAR-Fusion and Arriba common predictions and those fusions that are uniquely reported by each method: 3.DepMap9Lines/3b.DepMap9Lines_Benchmarking/3b.1.IlluminaTruSeqDepMap9Lines/__reevaluate_arriba_starF_overlap_via_coords/Examine_nonoverlapping_STARF_arriba.Rmd
Summary evaluation of DepMap long read fusion prediction accuracy using Illumina-based fusion truth sets: 3b.DepMap9Lines_Benchmarking/3b.2.Illumina_TP_unique_FP_bmarking/Illum_TP_uniq_FP_summary.Rmd (Figure 4h)
Example of benchmarking using the Illumina-based Arriba-intersect-StarFusion truth set: 3.DepMap9Lines/3b.DepMap9Lines_Benchmarking/3b.2.Illumina_TP_unique_FP_bmarking/__illum_TP_uniq_FP.arriba%2CstarF/DepMap9Lines_Benchmarking.illum_TP_uniq_FP.arriba%2CstarF.Rmd (Supplementary Figure S3)
Evaluating use of JAFFAL high-conf predictions only instead of all the predictions with Illumina-supported truth set: 3.DepMap9Lines/3b.DepMap9Lines_Benchmarking/3b.3.Illumina_TP_jaffal_highconfonly/Compare_JAFFAL_HighConf_ROC.Rmd which shows better overall P-R AUC values when using the entire prediction set, and so we continued to leverage the full JAFFAL predictions in all benchmarking experiments, ranked according to fusion read support consistently with all other methods evaluated.
Analysis: 3.DepMap9Lines/3a.CTAT_DepMap9Lines/CTAT_DepMap9Lines.Rmd (Figure 5)
Analysis: 3.DepMap9Lines/3a.CTAT_DepMap9Lines/3a.2.ThreePrimeBiasAnalysis/examine_3prime_breakpoint_readlengths.Rmd (Supplementary Figure S4)
ONT transcriptome sequences were obtained from the SG-NEx project.
Analysis of trusted fusions vs. others detected using ONT dRNA seqs: 5.Misc/5.3.SGNex_ONT_eval/SGNex_ONT_eval.Rmd (Figure 6)
Benchmarking ONT fusion detection using the trusted sequences and flagging uniquely predicted fusions as false positives:
- default mode: 5.Misc/5.3.SGNex_ONT_eval/5.3.3.SGNex_Illumina_benchmarking/default_mode/SGNEx_DefaultModes.Rmd
- first filtering fusions with breakpoints proximal to exon boundaries: 5.Misc/5.3.SGNex_ONT_eval/5.3.3.SGNex_Illumina_benchmarking/fuzzy_brkpt_restricted/SGNEx_FuzzyRestricted.Rmd
The melanoma patient sample RNA-seq is protected and available under dbgap: phs003200.v1.p1
Analysis of fusions using long and short read alignments: 4.SingleCellFusions/4a.sc_Melanoma/M132TS_analysis.Rmd (Figure 7a,b)
Evaluation of NUTM2A fusion cell content by using 'grep' with fusion breakpoint sequences: 4.SingleCellFusions/4a.sc_Melanoma/4a.1.grep_search_brkpt/GrepMatchedFusionCells.Rmd
These data are available at EGA under accessions EGAD00001009814 - PacBio and EGAD00001009815 - Illumina
Analysis of HGSOC Patient-1 : 4.SingleCellFusions/4b.sc_HGSOC/Patient1_analysis.Rmd (Figure 8a-c)
Analysis of HGSOC Patient-2 : 4.SingleCellFusions/4b.sc_HGSOC/Patient2_analysis.Rmd
Analysis of HGSOC Patient-3 : 4.SingleCellFusions/4b.sc_HGSOC/Patient3_analysis.Rmd (Figure 8d,e))
Shows that ctat-minimap2 in chimeric only mode is 4x faster than regular mode.
Analysis: 5.Misc/5.1.ctat-mm2-timings/ctat-mm2-timings.Rmd
Analysis: 5.Misc/5.2.fusion_workflow_resource_usages/ExamineResourceUsage.Rmd (Supplementary Figure 6)