KinActive
is a tool for protein kinase (PK) sequences and structures.
It's intended for two purposes:
- Assembling, managing, and describing a collection of protein kinase sequences and structures.
- Using structure- and sequence-based ML models to annotate protein kinase sequences and structures.
The package can be installed via pip:
pip install kinactive
or directly from this repository:
pip install git+https://github.com/edikedik/kinactive.git
Using virtual environments (e.g., conda) is (always) recommended.
The data and ML models were extensively described in the following papers (currently under review):
- Classifying protein kinase conformations with machine learning. Ivan Reveguk & Thomas Simonson.
- Uncovering DFG-out sequence propensity determinants of kinases with machine learning. Ivan Reveguk & Thomas Simonson.
In short, the data collection encompasses PK domain PDB structures nested under (and
mapped to) the canonical UniProt sequences. Each domain sequence is also mapped to a
single reference profile (PF00069 from Pfam). The collection is prepared using the
lXtractor. Thus, any updates in the data are
tied to lXtractor
updates and improvements.
All the ML models here are XGBoost binary
classifiers. There are two kinds of models: those annotating PK domain sequences and
structures. The sequence-based models predict a given sequence's propensity to adopt
DFG-in or DFG-out orientations, in the apo or holo states, separately for tyrosine
and serine/threonine kinases. To facilitate the distribution into TK and STK groups,
there is an additional sequence-based model TkST
that outputs 0 for STK domains and 1
for TKs.
The structure-based models annotate PK domain structures as active/inactive
and DFG-in/out/other. For active/inactive prediction, the model is a simple binary
classifier. On the other hand, DFGclassifier
is a stack of three XGBoost
models
with a LogisticRegression
on top. The latter was trained to give more accurate
predictions for border cases, where the conformation is ambiguous. Thus, the output
will entail the probabilities of the original XGB
models (each for in, out, and
other conformations), and the "balanced" meta-classifier probabilities, which, for
obvious cases, will not differ much from the XGB
probabilities.
After installing the package, the kinactive
CLI should be available. Execute
kinactive
in the terminal to see, which commands are available. Currently, there are
fetch
, db
, and predict
commands.
Use kinactive fetch
to download already prepared datasets. One can customize, which
data to download via options. For instance, running
kinactive fetch --db
will download only the PK data collection, whereas executing
kinactive fetch --all -rvu -o downloads
will fetch all the available data to ./downloads
, unpack and remove raw archives, and
output basic logging information about the progress.
The datasets can also be fetched directly via this link.
Use kinactive predict
to apply the ML models to a small number of sequences
or structures. For a more extensive data collection, one should first compile it
separately (see below). Then variable descriptors can be calculated here or also
separately using lXtractor
(see
this link
).
Examples:
This command will run sequence-based models on the SRC kinase sequence:
kinactive predict -t s -o ./seq_predictions P12931
which should output the following:
This command will run structure-based models on two SRC kinase structures:
kinactive predict -t S -o ./pdb_predictions 2OIQ 2SRC
Finally, the following command will run structure-based models on the AlphaFold2-predicted model of the SRC kinase:
kinactive predict -t a -o ./af2_predictions P12931
In each of these cases, the collection of chain sequences or chains structures,
the calculated variables necessary for the models, and predictions will be saved
to the *_predictions
directory. Note that these chains can also serve as input
to the models. For instance, to run sequence-based models on domain sequences
extracted from the 2OIQ and 2SRC entries, one could execute:
kinactive predict -t s -o ./str_seq_predictions -d ./pdb_predictions/chains/*/segments/*
Note that in the command above, we supplied a flag -d
to signify that the domains we
provided paths to already extracted domains (stored ChainSequence
objects). Also note
that flags can be concatenated, e.g., -dlv
would translate to "domains were
extracted; write a log file; verbose".
TBD: Using the CLI to compile the data collection is not implemented at the moment.
One can refer to
this link
for compiling the database from Python interpreter. Compiling arbitrary data collections
will be handled by a general-purpose customizable database protocol, which will be made
available with lXtractor
>=0.2. Stay tuned!
Advanced users may compile and explore data collection, calculate additional variables, run ML models, from within the Python interpreter. These use-cases are described in the kinactive documentation.