Version 1.1.3

hillerlab · Jun 30, 2023 · bd6ea2b · bd6ea2b
1 parent 9882fde
commit bd6ea2b
Show file tree

Hide file tree

Showing 9 changed files with 159 additions and 73 deletions.
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 
 ![TOGA logo](https://github.com/hillerlab/TOGA/blob/master/supply/logo.png)
 [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
-![version](https://img.shields.io/badge/version-1.1.0-blue)
+![version](https://img.shields.io/badge/version-1.1.3-blue)
 [![DOI](https://zenodo.org/badge/277817661.svg)](https://zenodo.org/badge/latestdoi/277817661)
 
 TOGA is a new method that integrates gene annotation, inferring orthologs and classifying
@@ -14,12 +14,14 @@ related species and to accurately distinguish orthologs from paralogs or process
 This tutorial explains how to get started using TOGA.
 It shows how to install and execute TOGA, and how to handle possible issues that may occur.
 
-[TOGA changelog](https://github.com/hillerlab/TOGA/blob/master/VersionHistory.md)
+For more details, please check out the [TOGA wiki](https://github.com/hillerlab/TOGA/wiki).
+
+[Changelog](https://github.com/hillerlab/TOGA/blob/master/VersionHistory.md).
 
 ## Installation
 
-TOGA supports both Linux and MacOS systems.
-The package was properly tested on Python version 3.6.5. and 3.7.3.
+TOGA is compatible with Linux and MacOS, including M1-based systems.
+It has been tested and verified to work with Python versions 3.9 and higher.
 
 It is highly recommended to have access to computational cluster, but
 for small or partial genomes with short genes a desktop PC will be enough.

diff --git a/VersionHistory.md b/VersionHistory.md
@@ -4,7 +4,7 @@ This was used to produce all data in the original manuscript:
 Kirilenko BM, Munegowda C, Osipova E, Jebb D, Sharma V, Blumer M, Morales AE, Ahmed AW, Kontopoulos DG, Hilgers L, Lindblad-Toh K, Karlsson EK, Zoonomia Consortium, Hiller M.
 [Integrating gene annotation with orthology inference at scale](https://www.biorxiv.org/content/10.1101/2022.09.08.507143v1). _Science_, in press, 2022
 
-# TOGA 1.1 #
+# TOGA 1.1.0 #
 
 
 ### Improved transcript classification ###
@@ -14,6 +14,9 @@ In TOGA 1.1, we now consider the region between compensating frameshifts as dele
 * Improved classification of N-terminal mutations. In TOGA 1.0, we masked inactivating mutations if they are located in the first 10% of the reading frame. The reason was that often there is a downstream ATG in proximity, but this is not always the case. 
 In TOGA 1.1, we now determine where the next inframe ATG codon is located that is downstream of a potential inactivating mutation. If this ATG is located in the first 10% of the reading frame, the mutation is not considered as inactivating (as in 1.0). However, if this ATG is located outside the first 10% of the reading frame, a truncated protein would lack at least 10% of the N-terminus and TOGA 1.1 now considers this mutation as inactivating in the transcript classification step. 
 
+* Improved classification of borderline cases where a projection has considerable amount of both deleted and missing regions.
+
+* Fixed intron retention module bug. Now, TOGA should correctly identify them.
 
 ### Codon alignments ###
 * The codonAlignment.fasta file previously contained the raw CESAR output, including exons that were classified as deleted or missing in later processing steps. TOGA 1.1 now generates a codonAlignment.fasta where such exons are excluded. 
@@ -26,5 +29,32 @@ In TOGA 1.1, we now determine where the next inframe ATG codon is located that i
 ### Minor changes ###
 * Longer lists of several compensating frameshifts (e.g. FS_1,2,3,4,5) could result in warnings when loading SQL tables if the string gets too long. To avoid this, we now write a condensed format FS_1-5. 
 * Avoiding overlapping exon annotations in fragmented query assemblies. If a gene is split into several query chains, there were rare cases where several query chains overlap in the query genome, resulting in exon annotations that overlap each other. This creates invalid bed or genePred records. In TOGA 1.1, we now classify such genes as missing and do not output a bed entry for this gene.
-* Sometimes a gene had several orthologous chains and one of them is a 'runaway' chain that spans megabases in the query locus, which is not computable in the CESAR step. TOGA 1.1 now ignores the runaway chain but runs CESAR for the other (normal) orthologous chains. This adds a few orthologs to the annotation. 
+* Sometimes a gene had several orthologous chains and one of them is a 'runaway' chain that spans megabases in the query locus, which is not computable in the CESAR step. TOGA 1.1 now ignores the runaway chain but runs CESAR for the other (normal) orthologous chains. This adds a few orthologs to the annotation.
+* Added TOGA_assemblyStats.py
+* Bug fixes in merge_cesar_output.py and CESAR output parser.
+* Minor changes in plot_mutations.py.
+
+
+### Nomenculature updates ###
+* `MIDDLE_80%_INTACT` projection's feature renamed to `MIDDLE_IS_PRESENT`
+* `SSM` mutation type (splice site mutation) is split into more precise `SSMD` (donor) and `SSMA` (acceptor)
+
+
+# TOGA 1.1.1 #
+
+### Critical bugfixes ###
+
+* Fixed codon alignment generation procedure - all codons are 3bp long. Previous version could mistakenly include dinucleoties into the codon alignment. 
+* Fixed parsing the protein and codon alignment output. Previous version could add a sequence to protein alignment if its ID contained "PROT".
+
+# TOGA 1.1.2 #
+
+Documentation improvements.
+
+# TOGA 1.1.3 #
+
+* Added support for Nextflow DSL2.
+* Enabled generation of the processed pseudogenes annotation track.
+* Removed `-f/--fragmented_genome` flag from toga.py CLI - the "assemble transcript from fragments" feature is enabled by default. Please use `--disable_fragments_joining/--dfj` option to disable this feature.
+* Fixed package versions in the `requirements.txt`
 
diff --git a/configure.sh b/configure.sh
@@ -2,16 +2,28 @@
 exit_status=0
 mydir="${0%/*}"
 
+OVERRIDE=false
+if [ "$1" == "--override" ]; then
+    OVERRIDE=true
+    echo "Overriding existing CESAR installation and models"
+fi
+
 printf "Compiling C code...\n"
-CFLAGS="-Wall -Wextra -O2 -g -std=c99" 
 
-gcc $CFLAGS -o ${mydir}/modules/chain_score_filter ${mydir}/modules/chain_score_filter.c 
-gcc $CFLAGS -o ${mydir}/modules/chain_filter_by_id ${mydir}/modules/chain_filter_by_id.c 
+# Check machine architecture and set appropriate flags
+if [ $(uname -m) = "arm64" ]; then
+    CFLAGS="-Wall -Wextra -O2 -g -std=c99" # adjust flags for M1 if necessary
+else
+    CFLAGS="-Wall -Wextra -O2 -g -std=c99" # original flags for x86
+fi
+
+gcc $CFLAGS -o ${mydir}/modules/chain_score_filter ${mydir}/modules/chain_score_filter.c
+gcc $CFLAGS -o ${mydir}/modules/chain_filter_by_id ${mydir}/modules/chain_filter_by_id.c
 gcc $CFLAGS -fPIC -shared -o ${mydir}/modules/chain_coords_converter_slib.so ${mydir}/modules/chain_coords_converter_slib.c
 gcc $CFLAGS -fPIC -shared -o ${mydir}/modules/extract_subchain_slib.so ${mydir}/modules/extract_subchain_slib.c
 gcc $CFLAGS -fPIC -shared -o ${mydir}/modules/chain_bst_lib.so ${mydir}/modules/chain_bst_lib.c
 
-if [[ -f "./models/se_model.dat" ]] || [[ -f "./models/me_model.dat" ]]
+if ! $OVERRIDE && { [[ -f "./models/se_model.dat" ]] || [[ -f "./models/me_model.dat" ]]; }
 then
     printf "Model found\n";
 else
@@ -20,7 +32,7 @@ else
     printf "Model created\n"
 fi
 
-if [[ -f "./CESAR2.0/cesar" ]]
+if ! $OVERRIDE && [[ -f "./CESAR2.0/cesar" ]]
 then
     printf "CESAR installation found\n"
 else

diff --git a/execute_joblist.nf b/execute_joblist.nf
@@ -1,5 +1,8 @@
 #!/usr/bin/env nextflow
-// Author: Bogdan Kirilenko, 2020
+
+nextflow.enable.dsl=2
+
+// Author: Bogdan Kirilenko, 2023
 // Nextflow procedure to execute chain feature extraction jobs
 // Joblist contains a file where each line is a separate command
 // We just call these lines in parallel
@@ -9,22 +12,29 @@ params.joblist = 'NONE'  // file containing jobs
 
 // if still default -> nothing assigned: show usage message and quit
 if (params.joblist == "NONE"){
-    printf("Usage: nextflow execute_joblist.nf  --joblist [joblist file] -c [config file]\n")
+    println("Usage: nextflow execute_joblist.nf  --joblist [joblist file] -c [config file]")
     System.exit(2);
-} 
+}
 
 // create channel lines -> we need to execute lines in parallel
-joblist = file(params.joblist)
-lines = Channel.from(joblist.readLines())
+lines = Channel.fromPath(params.joblist).splitText()
 
 process execute_jobs {
-    
+
     // allow each process to fail 3 times
     errorStrategy 'retry'
     maxRetries 3
 
     input:
-    val line from lines
+    val line
+
+    // one line represents an independent command
+    script:
+    """
+    ${line}
+    """
+}
 
-    "${line}"  // one line represents an independent command
+workflow {
+    execute_jobs(lines)
 }
diff --git a/modules/gene_losses_summary.py b/modules/gene_losses_summary.py
@@ -309,11 +309,11 @@ def get_projection_classes(
                 print(f"Threshold is {REM_T_L} codons; decide between L and M")
                 # print(f"-> class L")
             if frame_oub > 0.65:
-                print(f"Out of chain prob > 0.65: {frame_oub}")
-                print(f"-> class M")
+                # print(f"Out of chain prob > 0.65: {frame_oub}")
+                # print(f"-> class M")
                 projection_class[projection] = M
             else:
-                print("-> class L")
+                # print("-> class L")
                 projection_class[projection] = L
             continue
 

diff --git a/modules/make_pr_pseudogenes_anno.py b/modules/make_pr_pseudogenes_anno.py
@@ -127,9 +127,33 @@ def get_corr_q_regions(gene_to_pp_chains, chain_file, chain_dict, bed_bdb, v=Fal
     return proj_to_q_reg
 
 
+def merge_intervals(intervals):
+    # Sort the intervals by start position
+    intervals.sort(key=lambda x: (x[0], x[1], x[2]))
+
+    # Initialize the merged list with the first interval
+    merged = [intervals[0]]
+
+    for current in intervals[1:]:
+        # Get the last interval in the merged list
+        last = merged[-1]
+
+        # If the current interval overlaps with the last one and they are on the same chromosome and strand,
+        # merge them by extending the end of the last interval to the end of the current one
+        if current[0] == last[0] and current[5] == last[5] and current[1] <= last[2]:
+            last = list(last)
+            last[2] = max(last[2], current[2])
+            merged[-1] = tuple(last)
+        else:
+            # Otherwise, just add the current interval to the list
+            merged.append(current)
+
+    return merged
+
+
 def make_bed_lines(proj_to_reg):
     """Create bed9 lines."""
-    bed_lines = []
+    intervals = []
     for name, region in proj_to_reg.items():
         # just parse the region
         chrom = region[0]
@@ -144,15 +168,21 @@ def make_bed_lines(proj_to_reg):
             chrom,
             chrom_start,
             chrom_end,
-            name,
+            f"{name}-like",
             DEF_SCORE,
             strand,
             thick_start,
             thick_end,
             PINK_COLOR,
         )
-        line = "\t".join(str(x) for x in line_data)
-        bed_lines.append(line)
+        intervals.append(line_data)
+
+    # Merge overlapping intervals
+    intervals = merge_intervals(intervals)
+
+    # Convert the merged intervals back into bed lines
+    bed_lines = ["\t".join(str(x) for x in interval) for interval in intervals]
+
     return bed_lines
 
 
@@ -173,6 +203,10 @@ def create_ppgene_track(chain_class_file, chain_file, bed_bdb, output, v=None):
     chain_dict = load_chain_dict(index_file)
     verbose("Extracting pr pseudogene chains") if v else None
     gene_to_pp_chains = get_pp_gene_chains(chain_class_file, v)
+    if len(gene_to_pp_chains) == 0:
+        # no proc pseudogenes
+        verbose("No processed pseudogenes found, skip.")
+        return
     # for each gene-chain pair get corresponding region in query
     verbose("Extracting corresponding regions") if v else None
     projection_to_reg = get_corr_q_regions(

diff --git a/requirements.txt b/requirements.txt
@@ -1,8 +1,8 @@
-twobitreader
-networkx
-pandas
-numpy
-xgboost
-scikit-learn
-joblib
-h5py
+twobitreader==3.1.7
+networkx==3.1
+pandas==2.0.2
+numpy==1.24.3
+xgboost==1.7.5
+scikit-learn==1.2.2
+joblib==1.2.0
+h5py==3.8.0
diff --git a/run_test.sh b/run_test.sh
@@ -7,18 +7,14 @@ then
     exit 0
 fi
 
-eval ./configure.sh
-if [[ $? -ne 0 ]]
-then
-    echo "Configure + build failed"
-    exit 1
-else
-    echo "Configure successful!"
-fi
-
-test_out="test_out"
-eval rm -rf ${test_out}
-eval mkdir -p ${test_out}
+ eval ./configure.sh
+ if [[ $? -ne 0 ]]
+ then
+     echo "Configure + build failed"
+     exit 1
+ else
+     echo "Configure successful!"
+ fi
 
 if [ $1 = "micro" ];
 then
@@ -36,6 +32,7 @@ then
 elif [ $1 = "normal" ];
 then
     echo "Running widescale test"
+    eval "rm -rf test_out"
 
     hg38_2bit=$2
     mm10_2bit=$3
@@ -49,7 +46,7 @@ then
 
     cmd="./toga.py  ${test_chain} ${test_bed} \
          ${hg38_2bit} ${mm10_2bit} \
-         --chn 10 --project_name ${test_out} \
+         --chn 10 --project_name test_out \
          --kt -i supply/hg38.wgEncodeGencodeCompV34.isoforms.txt \
          --cjn 200 --ms --cb 2,4 \
          "