This repository contains the code for ORCA as described in our application note, Predicting replication origins in circular prokaryotic chromosomes. This repository contains the package ORCApy which is a collection of scripts that predict and plot the location of the origin of replication (oriC) of circular bacterial genomes based on Z-curve, GC-skew, dnaA-box, and gene location analyses. This README will not explain all of ORCA's methods. All functions, main or helper, are labelled with docs-strings and type-hinting for ease of use. Most functions in ORCA can be used separately as well. For example, calculate_disparity_curves
can be used outside of ORCA, too. As well as all functions for fetching files from NCBI, or plotter functions, etc. Please, see their provided docs-strings for more info.
To install ORCApy, simply use:
pip install orcapy
Or download it directly from the PyPi website here. The installation and download only include the files present in the src
folder.
To download the provided Random Forest Classifier, simply use:
wget https://github.com/ZoyavanMeel/ORCA/blob/main/data/output/machine_learning/ORCA_RFC_model_1_4_0.pkl.gz
Or use a similar method, like curl
. You can also download it manually by going here, clicking on the three horizontal dots in the top right and selecting 'Download'. See the section Data Output for more provided models.
This script predicts the origin of replication for circular bacterial DNA. It makes use of a combination of Z-curve and GC-skew, dnaA-box, and gene location analyses. You can load the required FASTA files yourself, or simply provide an accession and NCBI-account email and ORCA will fetch them. The docs-string of the function shows more information on what is needed to use ORCA. See the Example Use section for more information. NOTE: the model provided in this repository has been compressed due to file size limitations. This example also uses the Joblib package for unpickling the model. This is not necessary, and Python's standard pickle and gzip package can be used as well.
Please, make sure to load the proper file formats into the functions, otherwise ORCA will not work. A lot of invalid parameters will throw warnings or errors, but it is not improbable that a few were missed. More of ORCA's functionality and input methods are discussed in the paper.
There are four valid input formats for ORCA. Provided one has an internet connection, ORCA can function using only an accession number (from_accession
). When using only an accession, ORCA will fetch the corresponding GenBank file from NCBI using Biopython's Entrez module. Please adhere to NCBI's rules on making use of their API. This module functions as a Python interface for NCBI's Entrez Direct. The version number of the accession can be specified. If this is omitted, the most recent version is automatically chosen. This method is overall the slowest and takes about one minute to download and analyse for a chromosome of 5Mbp. One can also provide the GenBank file themselves (from_gbk
). This file will be processed the same way, but does not require an internet connection.
Another option is to provide Biopython SeqRecord objects (from_pkl
). ORCA uses Biopython's SeqIO module for parsing the GenBank file. This means that input from GenBank will go through this module anyway and uses SeqRecords. Providing a SeqRecord that was made by parsing a Genbank file using this same module is therefore also possible. Pickled SeqRecord objects take up more memory, but they can be parsed quicker. This is a consideration that can be taken into account when processing large genomic datasets.
Lastly, one can also simply provide a DNA-sequence string (from_string
). This same input method allows for indicating the location of the indicator genes. These gene locations are not inferred from the provided sequence and will have to be provided. This is the fastest input format, but takes the most work from the user. All input parameters are assumed to be in the correct specified format and will not be processed using Biopython. Further explanation on each of these formats can also be found in the documentation of the code.
With the RFC, we provide a general use-case oriC prediction tool as outlined in the application note. However, as shown in the code documentation, there are many parameters that can be fine-tuned and possibly improved. One of the reasons of ORCA being open-source, is to provide not just a transparent oriC-prediction tool, but also a building block for further research.
It is possible to tune all parameters and train models for highly accurate prediction of the oriCs of bacterial species of interest. This could be done by incorporating more indicator genes for that species, or changing any number of other parameters. It could also be possible to adapt ORCA for the use in the prediction of oriCs in linear prokaryotic chromosomes, or archaea. it is our hope that ORCA can help research into any or all of these avenues with its adaptability or provide a lightweight easy-to-use tool with good out-of-the-box performance.
>>> import joblib
>>> from orcapy import ORCA
>>> email = "example@email.com"
>>> model = joblib.load("path/to/model.pkl.gz")
>>> orca = ORCA.from_accession("NC_000913.3", email=email, model=model)
>>> orca.find_oriCs(show_info=True, plot_path=None)
Accession : NC_000913
predictions : 0.99412
Z_scores : 1.0
G_scores : 0.5
D_scores : 1.0
oriC_middles : 3927479
The best-scoring potential oriC was found at: 3927479 bp.
If you do not wish to use joblib, you can also open the model using:
import gzip, pickle
with gzip.open("path/to/model.pkl.gz", "rb") as fh:
model = pickle.load(fh)
In the case of Escherichia coli (E. coli) K-12, only one potential origin was found and according to the model, this could also be classified as a true origin. Any prediction value >= 0.5, means that the model believes there is a 50 % or more that the corresponding origin is a true origin. It is possible that multiple candidate origins have a probability of being a true origin that is larger than 50 %. Then simply use the origin corresponding to the highest probability, as this was the origin that the model deemed most likely to be correct.
This repository also includes a pickled SeqRecord
of the E. coli K-12 chromosome (here, download like the RFC model). If one wanted to use that, only a different constructor would have to be used.
>>> from orcapy import ORCA
>>> orca = ORCA.from_pkl("data/input/NC_000913_3.pkl", model=model)
>>> orca.find_oriCs(show_info=False, plot_path="path/to/plot.png")
There are four constructors in total: from_accession
, from_gbk
, from_pkl
, from_string
. Each comes with extensive documentation of how to use them. Once the ORCA
object has been instantiated, call find_oriCs
to analyse the sequence.
The parameters listed below are parameters that can be used as they are or can be fine-tuned for specific use cases. The standard parameters reflect ORCA's performance as shown in the application note. Retraining of the Random Forest Classifier is recommended if any parameters are changed. Otherwise, the same performance as is the paper can not be guaranteed.
dnaa_boxes
: If None, will use the consensus DnaA-box:TTAT(A|C|G|T)CACA
(see Section DnaA). Else, provide a list of 9 base strings. Example input:['AAAAAAAAA', 'TTTTTTTTT']
.max_mismatches
: Maximum allowed mismatches allowed in a dnaa_box for it still to be read as such. Recommended max: 2. ORCA uses 0 for use with the consensus DnaA-box.genes_of_interest
: List of gene names to consider as 'oriC-proximal' and use for helping estimate the location of the oriC. This parameter is case insensitive.max_point_spread
: Maximum distance between points in a group can have when looking for connected groups across the disparity curves. Default is 5 % of the total chromosome length.windows
: The windows around peaks of skew curves to consider. Defaults are 1, 3, and 5 % of the total chromosome length. ORCA checks each of the given windows.model
: A fitted scikit-learn classifier. Recommended to use the one provided on in this repository.
- Python 3.11.1
- SciPy 1.10.0
- Pandas 1.5.3
- NumPy 1.24.2
- Biopython 1.80
- Scikit-learn 1.4.1
- Matplotlib 3.6.3
The Peak
class. This class is used in handling potential oriCs. The oriCs
-attribute of an ORCA
object consists of a list of Peak
objects. Each Peak
represents a potential oriC and has attributes for its Z-, G-, and D-score as well as the confidence a potential machine learning model has.
Script with useful functions for I/O. These include functions for downloading, parsing, and saving files.
save_gbk
: Fetches and saves the Genbank file of the given accession in the given path.save_pkl
: Fetches and saves the given accession in the given path as a pickledSeqRecord
generated from parsing a Genbank file with Biopython.fetch_file
: Fetches the given accession in the provided file format from NCBI. Will return aTextIOWrapper
to be used.parse_SeqRecord
: Extracts the locations of the provided genes of interest into a dictionary.
To use these function call:
from orcapy import BioFile
There are 3 general functions in this file which can be used to plot any generic 1D-np.array. To use these functions, make sure to have matplotlib installed.
plot_Z_curve_3D
: Makes a 3D-plot of the Z-curve.plot_curves
: Can plot a maximum of four curves in a single 2D-plot. This one is useful for plotting single or multiple Z-curve or GC-skew components against each other.plot_skew
: Does the same asplot_curves
, except only takes one array.
To use these functions call:
from orcapy import Plotter
This folder contains all of the code used in analysing, training, and testing ORCA and its Random Forest Classifier. Each script comes with docs-string explaining their functionality. A copy of this folder is available in Zenodo, at https://dx.doi.org/10.5281/zenodo.10888686.
This folder contains all the input data. This includes the data from DoriC 12.0. The DoriC data were downloaded from: http://tubic.tju.edu.cn/doric/public/index.php on May 16th, 2023, but have also been preserved in Zenodo, at https://dx.doi.org/10.5281/zenodo.10888580. This folder also includes the experimental_dataset
CSV. This is a collection of oriCs that have been experimentally verified. It also includes sources for each chromosome. This dataset was made in order to check the quality of DoriC.
This folder also includes a pickled SeqRecord
object of the E. coli K-12 chromosome. This file was used for quick testing and demonstration.
This folder contains output from the various performance test we ran on ORCA. The machine_learning sub-folder contains the provided Random Forest Classifiers. Use the gzip and pickle from the standard Python library to load the models or, alternatively, use joblib. Please be mindful of your Scikit-learn version, as models are not always backwards-compatible.
ORCA_RFC_model_1_4_0.pkl.gz
: This model has been trained on the full DoriC 12.0 dataset. 1_4_0 refers to the scikit-learn version that was used to train the model (1.4.0). This repository also contains a fully trained model on scikit-learn 1.2.1.ORCA_RFC_model_70_percent.pkl.gz
: This model has been trained on roughly 70 % of the DoriC 12.0 dataset. First all the experimentally verified origins were subtracted and the remaining dataset was split 70:30 stratified. This model is has been included for posterity. Use the other model for any analyses. This model was trained on scikit-learn version 1.2.1.
The table below shows a small overview of common dnaA-boxes and relevant papers associated with them. This table is by no means complete, but can serve as a starting point for further research. We also include some more papers that were useful in researching the effects of the dnaA protein and its binding sites: [1], [2], [3], [4].
We use the first entry in the table below and allow for 0 mismatches in this sequence.
Sequence | Paper | Year | Notes |
---|---|---|---|
TTAT(A|C|G|T)CACA | [5], [6] | 2007/8 | In (roughly) all eubacteria, not just B. subtilis or E. coli |
TTATCCACA, TTTTCCACA, TTATCAACA, TCATTCACA, TTATACACA, TTATCCAAA | [7] | 1997 | Affinity study |
(T|C)(T|C)(A|T|C)T(A|C)C(A|G)(A|C|T)(A|C) | [8] | 1991 | Only in E. coli K12. Do not use. |
TGTG(G|T)ATAAC | [9] | 1985 | Matsui-box |
TTAT(A|C)CA(A|C)A | [10] | 1984 | The first consensus |