Skip to content

Commit

Permalink
Version 1.1.3
Browse files Browse the repository at this point in the history
  • Loading branch information
kirilenkobm committed Jun 30, 2023
1 parent 9882fde commit bd6ea2b
Show file tree
Hide file tree
Showing 9 changed files with 159 additions and 73 deletions.
10 changes: 6 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

![TOGA logo](https://github.com/hillerlab/TOGA/blob/master/supply/logo.png)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
![version](https://img.shields.io/badge/version-1.1.0-blue)
![version](https://img.shields.io/badge/version-1.1.3-blue)
[![DOI](https://zenodo.org/badge/277817661.svg)](https://zenodo.org/badge/latestdoi/277817661)

TOGA is a new method that integrates gene annotation, inferring orthologs and classifying
Expand All @@ -14,12 +14,14 @@ related species and to accurately distinguish orthologs from paralogs or process
This tutorial explains how to get started using TOGA.
It shows how to install and execute TOGA, and how to handle possible issues that may occur.

[TOGA changelog](https://github.com/hillerlab/TOGA/blob/master/VersionHistory.md)
For more details, please check out the [TOGA wiki](https://github.com/hillerlab/TOGA/wiki).

[Changelog](https://github.com/hillerlab/TOGA/blob/master/VersionHistory.md).

## Installation

TOGA supports both Linux and MacOS systems.
The package was properly tested on Python version 3.6.5. and 3.7.3.
TOGA is compatible with Linux and MacOS, including M1-based systems.
It has been tested and verified to work with Python versions 3.9 and higher.

It is highly recommended to have access to computational cluster, but
for small or partial genomes with short genes a desktop PC will be enough.
Expand Down
34 changes: 32 additions & 2 deletions VersionHistory.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ This was used to produce all data in the original manuscript:
Kirilenko BM, Munegowda C, Osipova E, Jebb D, Sharma V, Blumer M, Morales AE, Ahmed AW, Kontopoulos DG, Hilgers L, Lindblad-Toh K, Karlsson EK, Zoonomia Consortium, Hiller M.
[Integrating gene annotation with orthology inference at scale](https://www.biorxiv.org/content/10.1101/2022.09.08.507143v1). _Science_, in press, 2022

# TOGA 1.1 #
# TOGA 1.1.0 #


### Improved transcript classification ###
Expand All @@ -14,6 +14,9 @@ In TOGA 1.1, we now consider the region between compensating frameshifts as dele
* Improved classification of N-terminal mutations. In TOGA 1.0, we masked inactivating mutations if they are located in the first 10% of the reading frame. The reason was that often there is a downstream ATG in proximity, but this is not always the case.
In TOGA 1.1, we now determine where the next inframe ATG codon is located that is downstream of a potential inactivating mutation. If this ATG is located in the first 10% of the reading frame, the mutation is not considered as inactivating (as in 1.0). However, if this ATG is located outside the first 10% of the reading frame, a truncated protein would lack at least 10% of the N-terminus and TOGA 1.1 now considers this mutation as inactivating in the transcript classification step.

* Improved classification of borderline cases where a projection has considerable amount of both deleted and missing regions.

* Fixed intron retention module bug. Now, TOGA should correctly identify them.

### Codon alignments ###
* The codonAlignment.fasta file previously contained the raw CESAR output, including exons that were classified as deleted or missing in later processing steps. TOGA 1.1 now generates a codonAlignment.fasta where such exons are excluded.
Expand All @@ -26,5 +29,32 @@ In TOGA 1.1, we now determine where the next inframe ATG codon is located that i
### Minor changes ###
* Longer lists of several compensating frameshifts (e.g. FS_1,2,3,4,5) could result in warnings when loading SQL tables if the string gets too long. To avoid this, we now write a condensed format FS_1-5.
* Avoiding overlapping exon annotations in fragmented query assemblies. If a gene is split into several query chains, there were rare cases where several query chains overlap in the query genome, resulting in exon annotations that overlap each other. This creates invalid bed or genePred records. In TOGA 1.1, we now classify such genes as missing and do not output a bed entry for this gene.
* Sometimes a gene had several orthologous chains and one of them is a 'runaway' chain that spans megabases in the query locus, which is not computable in the CESAR step. TOGA 1.1 now ignores the runaway chain but runs CESAR for the other (normal) orthologous chains. This adds a few orthologs to the annotation.
* Sometimes a gene had several orthologous chains and one of them is a 'runaway' chain that spans megabases in the query locus, which is not computable in the CESAR step. TOGA 1.1 now ignores the runaway chain but runs CESAR for the other (normal) orthologous chains. This adds a few orthologs to the annotation.
* Added TOGA_assemblyStats.py
* Bug fixes in merge_cesar_output.py and CESAR output parser.
* Minor changes in plot_mutations.py.


### Nomenculature updates ###
* `MIDDLE_80%_INTACT` projection's feature renamed to `MIDDLE_IS_PRESENT`
* `SSM` mutation type (splice site mutation) is split into more precise `SSMD` (donor) and `SSMA` (acceptor)


# TOGA 1.1.1 #

### Critical bugfixes ###

* Fixed codon alignment generation procedure - all codons are 3bp long. Previous version could mistakenly include dinucleoties into the codon alignment.
* Fixed parsing the protein and codon alignment output. Previous version could add a sequence to protein alignment if its ID contained "PROT".

# TOGA 1.1.2 #

Documentation improvements.

# TOGA 1.1.3 #

* Added support for Nextflow DSL2.
* Enabled generation of the processed pseudogenes annotation track.
* Removed `-f/--fragmented_genome` flag from toga.py CLI - the "assemble transcript from fragments" feature is enabled by default. Please use `--disable_fragments_joining/--dfj` option to disable this feature.
* Fixed package versions in the `requirements.txt`

22 changes: 17 additions & 5 deletions configure.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,28 @@
exit_status=0
mydir="${0%/*}"

OVERRIDE=false
if [ "$1" == "--override" ]; then
OVERRIDE=true
echo "Overriding existing CESAR installation and models"
fi

printf "Compiling C code...\n"
CFLAGS="-Wall -Wextra -O2 -g -std=c99"

gcc $CFLAGS -o ${mydir}/modules/chain_score_filter ${mydir}/modules/chain_score_filter.c
gcc $CFLAGS -o ${mydir}/modules/chain_filter_by_id ${mydir}/modules/chain_filter_by_id.c
# Check machine architecture and set appropriate flags
if [ $(uname -m) = "arm64" ]; then
CFLAGS="-Wall -Wextra -O2 -g -std=c99" # adjust flags for M1 if necessary
else
CFLAGS="-Wall -Wextra -O2 -g -std=c99" # original flags for x86
fi

gcc $CFLAGS -o ${mydir}/modules/chain_score_filter ${mydir}/modules/chain_score_filter.c
gcc $CFLAGS -o ${mydir}/modules/chain_filter_by_id ${mydir}/modules/chain_filter_by_id.c
gcc $CFLAGS -fPIC -shared -o ${mydir}/modules/chain_coords_converter_slib.so ${mydir}/modules/chain_coords_converter_slib.c
gcc $CFLAGS -fPIC -shared -o ${mydir}/modules/extract_subchain_slib.so ${mydir}/modules/extract_subchain_slib.c
gcc $CFLAGS -fPIC -shared -o ${mydir}/modules/chain_bst_lib.so ${mydir}/modules/chain_bst_lib.c

if [[ -f "./models/se_model.dat" ]] || [[ -f "./models/me_model.dat" ]]
if ! $OVERRIDE && { [[ -f "./models/se_model.dat" ]] || [[ -f "./models/me_model.dat" ]]; }
then
printf "Model found\n";
else
Expand All @@ -20,7 +32,7 @@ else
printf "Model created\n"
fi

if [[ -f "./CESAR2.0/cesar" ]]
if ! $OVERRIDE && [[ -f "./CESAR2.0/cesar" ]]
then
printf "CESAR installation found\n"
else
Expand Down
26 changes: 18 additions & 8 deletions execute_joblist.nf
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
#!/usr/bin/env nextflow
// Author: Bogdan Kirilenko, 2020

nextflow.enable.dsl=2

// Author: Bogdan Kirilenko, 2023
// Nextflow procedure to execute chain feature extraction jobs
// Joblist contains a file where each line is a separate command
// We just call these lines in parallel
Expand All @@ -9,22 +12,29 @@ params.joblist = 'NONE' // file containing jobs

// if still default -> nothing assigned: show usage message and quit
if (params.joblist == "NONE"){
printf("Usage: nextflow execute_joblist.nf --joblist [joblist file] -c [config file]\n")
println("Usage: nextflow execute_joblist.nf --joblist [joblist file] -c [config file]")
System.exit(2);
}
}

// create channel lines -> we need to execute lines in parallel
joblist = file(params.joblist)
lines = Channel.from(joblist.readLines())
lines = Channel.fromPath(params.joblist).splitText()

process execute_jobs {

// allow each process to fail 3 times
errorStrategy 'retry'
maxRetries 3

input:
val line from lines
val line

// one line represents an independent command
script:
"""
${line}
"""
}

"${line}" // one line represents an independent command
workflow {
execute_jobs(lines)
}
6 changes: 3 additions & 3 deletions modules/gene_losses_summary.py
Original file line number Diff line number Diff line change
Expand Up @@ -309,11 +309,11 @@ def get_projection_classes(
print(f"Threshold is {REM_T_L} codons; decide between L and M")
# print(f"-> class L")
if frame_oub > 0.65:
print(f"Out of chain prob > 0.65: {frame_oub}")
print(f"-> class M")
# print(f"Out of chain prob > 0.65: {frame_oub}")
# print(f"-> class M")
projection_class[projection] = M
else:
print("-> class L")
# print("-> class L")
projection_class[projection] = L
continue

Expand Down
42 changes: 38 additions & 4 deletions modules/make_pr_pseudogenes_anno.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,9 +127,33 @@ def get_corr_q_regions(gene_to_pp_chains, chain_file, chain_dict, bed_bdb, v=Fal
return proj_to_q_reg


def merge_intervals(intervals):
# Sort the intervals by start position
intervals.sort(key=lambda x: (x[0], x[1], x[2]))

# Initialize the merged list with the first interval
merged = [intervals[0]]

for current in intervals[1:]:
# Get the last interval in the merged list
last = merged[-1]

# If the current interval overlaps with the last one and they are on the same chromosome and strand,
# merge them by extending the end of the last interval to the end of the current one
if current[0] == last[0] and current[5] == last[5] and current[1] <= last[2]:
last = list(last)
last[2] = max(last[2], current[2])
merged[-1] = tuple(last)
else:
# Otherwise, just add the current interval to the list
merged.append(current)

return merged


def make_bed_lines(proj_to_reg):
"""Create bed9 lines."""
bed_lines = []
intervals = []
for name, region in proj_to_reg.items():
# just parse the region
chrom = region[0]
Expand All @@ -144,15 +168,21 @@ def make_bed_lines(proj_to_reg):
chrom,
chrom_start,
chrom_end,
name,
f"{name}-like",
DEF_SCORE,
strand,
thick_start,
thick_end,
PINK_COLOR,
)
line = "\t".join(str(x) for x in line_data)
bed_lines.append(line)
intervals.append(line_data)

# Merge overlapping intervals
intervals = merge_intervals(intervals)

# Convert the merged intervals back into bed lines
bed_lines = ["\t".join(str(x) for x in interval) for interval in intervals]

return bed_lines


Expand All @@ -173,6 +203,10 @@ def create_ppgene_track(chain_class_file, chain_file, bed_bdb, output, v=None):
chain_dict = load_chain_dict(index_file)
verbose("Extracting pr pseudogene chains") if v else None
gene_to_pp_chains = get_pp_gene_chains(chain_class_file, v)
if len(gene_to_pp_chains) == 0:
# no proc pseudogenes
verbose("No processed pseudogenes found, skip.")
return
# for each gene-chain pair get corresponding region in query
verbose("Extracting corresponding regions") if v else None
projection_to_reg = get_corr_q_regions(
Expand Down
16 changes: 8 additions & 8 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
twobitreader
networkx
pandas
numpy
xgboost
scikit-learn
joblib
h5py
twobitreader==3.1.7
networkx==3.1
pandas==2.0.2
numpy==1.24.3
xgboost==1.7.5
scikit-learn==1.2.2
joblib==1.2.0
h5py==3.8.0
23 changes: 10 additions & 13 deletions run_test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,18 +7,14 @@ then
exit 0
fi

eval ./configure.sh
if [[ $? -ne 0 ]]
then
echo "Configure + build failed"
exit 1
else
echo "Configure successful!"
fi

test_out="test_out"
eval rm -rf ${test_out}
eval mkdir -p ${test_out}
eval ./configure.sh
if [[ $? -ne 0 ]]
then
echo "Configure + build failed"
exit 1
else
echo "Configure successful!"
fi

if [ $1 = "micro" ];
then
Expand All @@ -36,6 +32,7 @@ then
elif [ $1 = "normal" ];
then
echo "Running widescale test"
eval "rm -rf test_out"

hg38_2bit=$2
mm10_2bit=$3
Expand All @@ -49,7 +46,7 @@ then

cmd="./toga.py ${test_chain} ${test_bed} \
${hg38_2bit} ${mm10_2bit} \
--chn 10 --project_name ${test_out} \
--chn 10 --project_name test_out \
--kt -i supply/hg38.wgEncodeGencodeCompV34.isoforms.txt \
--cjn 200 --ms --cb 2,4 \
"
Expand Down
Loading

0 comments on commit bd6ea2b

Please sign in to comment.