Cross-Beta predictor

Welcome to the Cross-Beta predictor, an Extratrees-based predictor trained to identify naturally occurring cross-beta-forming amyloids. While this algorithm has been trained against no-forming disordered protein regions, it also has shown very good results in making the difference between regular proteins and amyloids.

This is a version usable with command lines and uses multiprocessing. Below is a description of all the arguments you can use to run it as well as the different conditions to make it work.

Most used packages are already given when installing Python, however, some of them might need to be updated or installed.

An online version is also provided at https://bioinfo.crbm.cnrs.fr/index.php?route=tools&tool=35.

Table of content

Libraries and dependancies
Files
1. CB_RF_pred.py, the main file
2. utils folder
Arguments and how to use the Cross-Beta predictor with command lines
1. Options
2. Usage
Example of utilization
Contact or cite us

1. Libraries and dependancies

Cross-Beta predictor has been tested for python 3.8 and python 3.12

pickle (v 0.0.12) (installed by default since python 3.7)
numpy (v 1.24.2)
pandas (v 2.0.0)
matplotlib (v 3.7.1)
scipy (1.8.0)
scikit-learn (1.3.1) Using a different version may invalidate your results, as the model used was built in version 1.3.1.

This work also uses a protein disorder predictor IUPred3 which you can download by requesting their website.

Note that IUPred3 is under academic license.

Once you have downloaded IUPRed3, you can put it in the utils folder as:

- Cross-Beta_predictor
|
| - utils
|  |
|  | - iupred3 <------
|  |  |
|  |  | ...
|  |
|  | check_install.py
|  | fold_pred.py
|  | ...
|
| ...

2. Files

i. `CB_RF_pred.py`, the main file

CB_RF_pred.py: The main Python script. Will parse the given arguments by the command line, verify them and use them to run the main function: Cross_Beta_RF_pred(). In this function, you will find the extraction of the given sequences and label them to run the prediction function: run_prediction_fasta_longseq(). As some used algorithms only work with the 20 essential amino acids code, a regex filter r"^[ARNDCQEGHILKMFPSTWYV]*$"is applied on all the given sequences. All the sequences that don't match this expression won't be predicted

ii. `utils` folder

utils.py: the file where you will find the sequence extraction function as well as the function for drawing graphs and saving data in .csv format. You will also find a function to detect if the script must run on a folder or on a file and a function to check if a given argument has a None value.
fold_pred.py: the file handling all the disorder predictions by using IUpred3.
get_features.py: the file where all function generating and extracting features from amino acid sequences are.
progress_bar.py: the script where functions are about generating a progress bar and computing the estimated remaining time.

3. Arguments and how to use the Cross-Beta predictor with command lines

i. Options

-h, --help: show this help message and exit

-ci, --check_install: Check if all folders and files are correctly named and in the right place. Check if all dependencies are importable

-i INPUT, --input INPUT: Input file path, can be file or folder or an amino acid sequence

-it INPUT_TYPE, --input_type INPUT_TYPE: Define the input type (sequence, CSV or fasta) (default: 'fasta').

-nc NAME_COL, --name_col NAME_COL: The name of the column containing the sequence ids. Only needed if the input type is CSV.

-sc SEQ_COL, --seq_col SEQ_COL: The name of the column containing the sequences. Only needed if the input type is CSV.

-t THRESHOLD, --threshold THRESHOLD: Classification threshold. Must be between 0 and 1 (Default: 0.54)

-g DRAW_GRAPH, --draw_graph DRAW_GRAPH: Active or not the creation of graph for each result in the given output path

-o OUTPUT, --output OUTPUT: Output file name, the output results will always be saved in the 'results/' folder (default: prediction_result.csv)

ii. Usage

3 types of inputs are handled: path to a folder, path to a file, or an amino acid sequence. To make the program work for all those cases, you must specify the type of the given input: sequence, CSV, or fasta. From this, if the input is a folder, the program will run on all the CSV or fasta present in it.

If the type is CSV, you must enter the 2 additional arguments -nc (the name of the column where to find a sequence ID or equivalent) and -sc (the name of the column where to find the sequences to run the prediction on).

Then you can choose a threshold value going from 0 to 1 for the prediction. All results strictly superior to the threshold will be considered amyloidogenic. The default and recommended threshold is set to 0.54.

Two outputs are generated, the CSV result file name is determined by the -o argument (default set to prediction_result.csv). It contains the ID of the sequence, the length, the global prediction score, the amyloid regions (of minimum 15 amino acids with a score > threshold) and the detailed score for all the amino acids of the sequence. The other output optionally generated is the graphic view of the amyloidogenicity score for all the amino acids of the sequence. The name of the graphic is based on the id of the corresponding sequence. This option can be activated or not by entering True or False to the -g argument (Default set to False).

4. Example of utilisation

i. Check the installation of Cross-Beta predictor

python3  CB_RF_pred.py  \
-ci

ii. Run the program with a single protein sequence

python3  CB_RF_pred.py  \
-i  MASNDYTQQATQSYGAYPTQPGQGYSQQSSQPYGQQSYSGYSQSTDTSGYGQSSYSSYGQSQNTGYGTQSTPQGYGSTGGYGSSQSSQSSYGQQSSYPGYGQQPAPSSTSGSYGSSSQSSSYGQPQSGSYSQQPSYGGQQQSYGQQQSYNPPQGYGQQNQYNSSSGGGGGGGGGGNYGQDQSSMSSGGGSGGGYGNQDQSGGGGSGGYGQQDRGGRGRGGSGGGGGGGGGGYNRSSGGYEPRGRGGGRGGRGGMGGSDRGGFNKFGGPRDQGSRHDSEQDNSDNNTIFVQGLGENVTIESVADYFKQIGIIKTNKKTGQPMINLYTDRETGKLKGEATVSFDDPPSAKAAIDWFDGKEFSGNPIKVSFATRRADFNRGGGNGRGGRGRGGPMGRGGYGGGGSGGGGRGGFPSGGGGGGGQQRAGDWKCPNPTCENMNFSWRNECNQCKAPKPDGPGGGPGGSHMGGNYGDDRRGGRGGYDRGGYRGRGGDRGGFRGGRGGGDRGGFGPGKMDSRGEHRQDRRERPY  \
-it  sequence  \
-g

The given input is an amino acid sequence and a result graphic will be drawn. All results will be saved in the /results/ directory in the Cross-Beta_RF_pred folder.

iii. Run the program with a CSV file

python3  CB_RF_pred.py  \
-i  example/Test_csv.csv  \
-it  'csv'  \
-nc  'Label'  \
-sc  'Sequence'

Run the prediction on the CSV available in /example/Test_csv.csv without drawing the resulting graph. For a CSV, label and sequence column names must be specified using -nc and -sc.

iv. Run the program with a fasta file (work also for multifasta)

python3  CB_RF_pred.py  \
-i  example/Test_fasta.fasta  \
-it  'fasta'  \
-g  \
-o  fasta_result.csv

Run the prediction on the fasta file available in /example/Test_fasta.fasta, create graphic results and store the CSV result file as 'fasta_result.csv'

5. Contact or cite us

Author

Valentin Gonay

GitHub

For any issue with the program, you can address a mail with [Cross-Beta RF] in subject to valentin.gonay@crbm.cnrs.fr

Citing Cross-Beta predictor

Gonay V, Dunne MP, Caceres-Delpiano J, Kajava AV. Developing machine-learning-based amyloidogenicity predictors with Cross-Beta DB. Alzheimer's Dement. 2025; 1-7. https://doi.org/10.1002/alz.14510

Citing IUPred3

Gábor Erdős, Mátyás Pajkos, Zsuzsanna Dosztányi IUPred3 - improved prediction of protein disorder with a focus on specific user applications Nucleic Acids Research 2021, Submitted Bálint Mészáros, Gábor Erdős, Zsuzsanna Dosztányi (2018) IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic Acids Research;46(W1):W329-W337.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
example		example
utils		utils
CB_RF_pred.py		CB_RF_pred.py
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cross-Beta predictor

Table of content

1. Libraries and dependancies

2. Files

i. `CB_RF_pred.py`, the main file

ii. `utils` folder

3. Arguments and how to use the Cross-Beta predictor with command lines

i. Options

ii. Usage

4. Example of utilisation

i. Check the installation of Cross-Beta predictor

ii. Run the program with a single protein sequence

iii. Run the program with a CSV file

iv. Run the program with a fasta file (work also for multifasta)

5. Contact or cite us

Author

Citing Cross-Beta predictor

Citing IUPred3

About

Languages

License

Valentin-Gonay/cross-beta-predictor

Folders and files

Latest commit

History

Repository files navigation

Cross-Beta predictor

Table of content

1. Libraries and dependancies

2. Files

i. CB_RF_pred.py, the main file

ii. utils folder

3. Arguments and how to use the Cross-Beta predictor with command lines

i. Options

ii. Usage

4. Example of utilisation

i. Check the installation of Cross-Beta predictor

ii. Run the program with a single protein sequence

iii. Run the program with a CSV file

iv. Run the program with a fasta file (work also for multifasta)

5. Contact or cite us

Author

Citing Cross-Beta predictor

Citing IUPred3

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

i. `CB_RF_pred.py`, the main file

ii. `utils` folder