-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 68199e0
Showing
38 changed files
with
6,050 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Sphinx build info version 1 | ||
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. | ||
config: 0c43a31e5cf1e710c67910c0b0d7b0a8 | ||
tags: 645f666f9bcd5a90fca523b33c5a78b7 |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
API | ||
=== | ||
|
||
Data | ||
---- | ||
|
||
.. autoclass:: intrepppid.data.ppi_oma.IntrepppidDataset | ||
:members: | ||
:special-members: __init__, __getitem__, __len__ | ||
|
||
.. autoclass:: intrepppid.data.ppi_oma.IntrepppidDataModule | ||
:members: | ||
:special-members: __init__ | ||
|
||
Network | ||
------- | ||
|
||
.. autofunction:: intrepppid.intrepppid_network |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,118 @@ | ||
Command Line Interface | ||
====================== | ||
|
||
INTREPPPID has a :abbr:`CLI (Command Line Interface)` which can be used to easily train INTREPPPID. | ||
|
||
Train | ||
----- | ||
|
||
To train the INTREPPPID model as it was in the manuscript, use the ``train e2e_rnn_triplet`` command: | ||
|
||
.. code:: bash | ||
$ intrepppid train e2e_rnn_triplet DATASET.h5 spm.model 3 100 80 --seed 3927704 --vocab_size 250 --trunc_len 1500 --embedding_size 64 --rnn_num_layers 2 --rnn_dropout_rate 0.3 --variational_dropout false --bi_reduce last --workers 4 --embedding_droprate 0.3 --do_rate 0.3 --log_path logs/e2e_rnn_triplet --beta_classifier 2 --use_projection false --optimizer_type ranger21_xx --lr 1e-2 | ||
.. list-table:: INTREPPPID Manuscript Values for ``e2e_rnn_triplet`` | ||
:widths: 25 25 25 50 | ||
:header-rows: 1 | ||
|
||
* - Argument/Flag | ||
- Default | ||
- Manuscript Value | ||
- Description | ||
* - ``PPI_DATASET_PATH`` | ||
- None | ||
- See Data | ||
- Path to the PPI dataset. Must be in the INTREPPPID HDF5 format. | ||
* - ``SENTENCEPIECE_PATH`` | ||
- None | ||
- See Data | ||
- Path to the SentencePiece model. | ||
* - ``C_TYPE`` | ||
- None | ||
- ``3`` | ||
- Specifies which dataset in the INTREPPPID HDF5 dataset to use by specifying the C-type. | ||
* - ``NUM_EPOCHS`` | ||
- None | ||
- ``100`` | ||
- Number of epochs to train the model for. | ||
* - ``BATCH_SIZE`` | ||
- None | ||
- ``80`` | ||
- The number of samples to use in the batch. | ||
* - ``--seed`` | ||
- None | ||
- ``8675309`` or ``5353456`` or ``3927704`` depending on the experiment. | ||
- The random seed. If not specified, chosen at random. | ||
* - ``--vocab_size`` | ||
- ``250`` | ||
- ``250`` | ||
- The number of tokens in the SentencePiece vocabulary. | ||
* - ``--trunc_len`` | ||
- ``1500`` | ||
- ``1500`` | ||
- Length at which to truncate sequences. | ||
* - ``--embedding_size`` | ||
- ``64`` | ||
- ``64`` | ||
- The size of embeddings. | ||
* - ``--rnn_num_layers`` | ||
- ``2`` | ||
- ``2`` | ||
- The number of layers in the AWD-LSTM encoder to use. | ||
* - ``--rnn_dropout_rate`` | ||
- ``0.3`` | ||
- ``0.3`` | ||
- The dropconnect rate for the AWD-LSTM encoder. | ||
* - ``--variational_dropout`` | ||
- ``false`` | ||
- ``false`` | ||
- Whether to use variational dropout, as described in the AWD-LSTM manuscript. | ||
* - ``--bi_reduce`` | ||
- ``last`` | ||
- ``last`` | ||
- Method to reduce the two LSTM embeddings for both directions. Must be one of "concat", "max", "mean", "last". | ||
* - ``--workers`` | ||
- ``4`` | ||
- ``4`` | ||
- The number of processes to use for the DataLoader. | ||
* - ``--embedding_droprate`` | ||
- ``0.3`` | ||
- ``0.3`` | ||
- The amount of Embedding Dropout to use (a la AWD-LSTM). | ||
* - ``--do_rate`` | ||
- ``0.3`` | ||
- ``0.3`` | ||
- The amount of dropout to use in the MLP Classifier. | ||
* - ``--log_path`` | ||
- ``"./logs/e2e_rnn_triplet"`` | ||
- ``"./logs/e2e_rnn_triplet"`` | ||
- The path to save logs. | ||
* - ``--encoder_only_steps`` | ||
- ``-1`` (No Steps) | ||
- ``-1`` (No Steps) | ||
- The number of steps to only train the encoder and not the classifier. | ||
* - ``--classifier_warm_up`` | ||
- ``-1`` (No Steps) | ||
- ``-1`` (No Steps) | ||
- The number of steps to only train the classifier and not the encoder. | ||
* - ``--beta_classifier`` | ||
- ``4`` (25% contribution of the classifier loss, 75% contribution of the orthologue loss) | ||
- ``2`` (50% contribution of the classifier loss, 50% contribution of the orthologue loss) | ||
- Adjusts the amount of weight to give the PPI Classification loss, relative to the Orthologue Locality loss. The loss becomes (1/β)×(classifier_loss) + [1-(1/β)]×(orthologue_loss). | ||
* - ``--lr`` | ||
- ``1e-2`` | ||
- ``1e-2`` | ||
- Learning rate to use. | ||
* - ``--use_projection`` | ||
- ``false`` | ||
- ``false`` | ||
- Whether to use a projection network after the encoder. | ||
* - ``--checkpoint_path`` | ||
- ``log_path / model_name / "chkpt"`` | ||
- ``log_path / model_name / "chkpt"`` | ||
- The location where checkpoints are to be saved. | ||
* - ``--optimizer_type`` | ||
- ``ranger21`` | ||
- ``ranger21_xx`` | ||
- The optimizer to use while training. Must be one of ``ranger21``, ``ranger21_xx``, ``adamw``, ``adamw_1cycle``, or ``adamw_cosine``. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
Data | ||
==== | ||
|
||
Precomputed Datasets | ||
-------------------- | ||
|
||
You can download precomputed datasets from the sources below: | ||
|
||
1. `Zenodo <https://doi.org/10.5281/zenodo.10594149>`_ (DOI: 10.5281/zenodo.10594149) | ||
2. `Internet Archive <https://archive.org/details/intrepppid_datasets.tar>`_ | ||
|
||
All datasets are made available under the `Creative Commons Attribution-ShareAlike 4.0 International <https://creativecommons.org/licenses/by-sa/4.0/legalcode>`_ license. | ||
|
||
Dataset Format | ||
-------------- | ||
|
||
INTREPPPID requires that datasets be prepared specifically in `HDF5 <https://en.wikipedia.org/wiki/Hierarchical_Data_Format>`_ files. | ||
|
||
Each INTREPPPID dataset must have the following hierarchical structure | ||
|
||
.. code:: | ||
intrepppid.h5 | ||
├── orthologs | ||
├── sequences | ||
│ | ||
├── splits | ||
│ ├── test | ||
│ ├── train | ||
│ └── val | ||
│ | ||
└── interactions | ||
├── c1 | ||
│ ├── c1_train | ||
│ ├── c1_val | ||
│ └── c1_test | ||
│ | ||
├── c2 | ||
│ ├── c2_train | ||
│ ├── c2_val | ||
│ └── c2_test | ||
│ | ||
└── c3 | ||
├── c2_train | ||
├── c2_val | ||
└── c2_test | ||
All but one of the "c" folders under "interactions" need be present, so long as that is the dataset you specify in the train step with the ``--c_type`` flag. | ||
|
||
Here is the schema for the tables: | ||
|
||
.. list-table:: ``orthologs`` schema | ||
:widths: 25 25 25 50 | ||
:header-rows: 1 | ||
|
||
* - Field Name | ||
- Type | ||
- Example | ||
- Description | ||
* - ``ortholog_group_id`` | ||
- ``Int64`` | ||
- ``1048576`` | ||
- The `OMA <https://omabrowser.org/oma/home/>`_ Group ID of the protein in the ``protein_id`` column | ||
* - ``protein_id`` | ||
- ``String`` | ||
- ``M7ZLH0`` | ||
- The `UniProt <https://www.uniprot.org/>`_ accession of a protein with OMA Group ID ``ortholog_group_id`` | ||
|
||
.. list-table:: ``sequences`` schema | ||
:widths: 25 25 25 50 | ||
:header-rows: 1 | ||
|
||
* - Field Name | ||
- Type | ||
- Example | ||
- Description | ||
* - ``name`` | ||
- ``String`` | ||
- ``Q9NZE8`` | ||
- The `UniProt <https://www.uniprot.org/>`_ accession that corresponds to the amino acid sequence in the ``sequence`` column. | ||
* - ``sequence`` | ||
- ``String`` | ||
- ``MAASAFAGAVRAASGILRPLNI``... | ||
- The amino acid sequence indicated by the ``name`` column. | ||
|
||
.. list-table:: Schema for all tables under ``interactions`` | ||
:widths: 25 25 25 50 | ||
:header-rows: 1 | ||
|
||
* - Field Name | ||
- Type | ||
- Example | ||
- Description | ||
* - ``protein_id1`` | ||
- ``String`` | ||
- ``Q9BQB4`` | ||
- The `UniProt <https://www.uniprot.org/>`_ accession of the first protein in the interaction pair. | ||
* - ``protein_id2`` | ||
- ``String`` | ||
- ``Q9NYF0`` | ||
- The `UniProt <https://www.uniprot.org/>`_ accession of the second protein in the interaction pair. | ||
* - ``omid_protein_id`` | ||
- ``String`` | ||
- ``C1MTX6`` | ||
- The `UniProt <https://www.uniprot.org/>`_ accession of the anchor protein for the orthologous locality loss. | ||
* - ``omid_id`` | ||
- ``Int64`` | ||
- ``737336`` | ||
- The `OMA <https://omabrowser.org/oma/home/>`_ Group ID of the anchor protein, from which a positive protein can be chose for the orthologous locality loss. | ||
* - ``label`` | ||
- ``Bool`` | ||
- ``False`` | ||
- Label indicating whether ``protein_id1`` and ``protein_id2`` interact with one another. | ||
|
||
Everything under the |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,127 @@ | ||
User Guide | ||
========== | ||
|
||
Training | ||
-------- | ||
|
||
The easiest way to start training INTREPPPID is to use the :doc:`CLI <cli>`. | ||
|
||
An example of running the training loop with the values used in the INTREPPPID manuscript is as follows: | ||
|
||
.. code:: bash | ||
$ intrepppid train e2e_rnn_triplet DATASET.h5 spm.model 3 100 80 --seed 3927704 --vocab_size 250 --trunc_len 1500 --embedding_size 64 --rnn_num_layers 2 --rnn_dropout_rate 0.3 --variational_dropout false --bi_reduce last --workers 4 --embedding_droprate 0.3 --do_rate 0.3 --log_path logs/e2e_rnn_triplet --beta_classifier 2 --use_projection false --optimizer_type ranger21_xx --lr 1e-2 | ||
Checkpoints will be saved in a folder ``logs/e2e_rnn_triplet/model_name/chkpt`` and can be used for inference. | ||
|
||
Inference | ||
--------- | ||
|
||
The easiest way to infer using INTREPPPID is through the website `https://PPI.bio <https://ppi.bio>`_. However, you may wish to infer locally using INTREPPID for various reasons, `e.g.`: to infer using your own custom checkpoints. | ||
|
||
Preparing Data | ||
^^^^^^^^^^^^^^ | ||
|
||
To infer using INTREPPPID, you'll have to use the :doc:`API <api>`. | ||
|
||
The first step is to get the amino acid sequences you want to infer. This can be as simple as defining a list of sequence pairs: | ||
|
||
.. code:: python | ||
sequence_pairs = [ | ||
("MANQRLS","MGPLSS"), | ||
("MQQNLSS","MPWNLS"), | ||
] | ||
You'll need to encode all the sequence, and you'll need to use the same settings that were used during training. Using the same parameters as used in the dataset: | ||
|
||
.. code:: python | ||
from intrepppid.data.ppi_oma import IntrepppidDataset | ||
import sentencepiece as sp | ||
trunc_len = 1500 | ||
spp = sp.SentencePieceProcessor(model_file=SPM_FILE) | ||
encoded_sequence_pairs = [] | ||
for p1, p2 in sequence_pairs: | ||
x1 = IntrepppidDataset.static_encode(trunc_len, spp, p1) | ||
x2 = IntrepppidDataset.static_encode(trunc_len, spp, p2) | ||
# Infer interactions here | ||
Alternatively, you may be interested in loading sequences from an INTREPPPID dataset to do testing. You can use the :py:class:`intrepppid.data.ppi_oma.IntrepppidDataModule`. | ||
|
||
.. code:: python | ||
from intrepppid.data.ppi_oma import IntrepppidDataModule | ||
batch_size = 80 | ||
data_module = IntrepppidDataModule( | ||
batch_size = batch_size, | ||
dataset_path = DATASET_PATH, | ||
c_type = 3, | ||
trunc_len = 1500, | ||
workers = 4, | ||
vocab_size = 250, | ||
model_file = SPM_FILE, | ||
seed = 8675309, | ||
sos = False, | ||
eos = False, | ||
negative_omid = True | ||
) | ||
data_module.setup() | ||
for batch in data_module.test_dataloader(): | ||
p1_seq, p2_seq, _, _, _, label = batch | ||
# Infer interactions here | ||
Load the INTREPPPID network | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
We must now instantiate the INTREPPPID network and load weights. | ||
|
||
If you trained the INTREPPPID with the manuscript defaults, you pass any values to :py:func:`intrepppid.intrepppid_network`. | ||
|
||
.. code:: python | ||
from intrepppid import intrepppid_network | ||
# steps_per_epoch is 0 here because it is not used for inference | ||
net = intrepppid_network(0) | ||
net.eval() | ||
chkpt = torch.load(CHECKPOINT_PATH) | ||
net.load_state_dict(chkpt['state_dict']) | ||
Infer Interactions | ||
^^^^^^^^^^^^^^^^^^ | ||
|
||
Putting everything together, you get: | ||
|
||
.. code:: python | ||
for p1, p2 in sequence_pairs: | ||
x1 = IntrepppidDataset.static_encode(trunc_len, spp, p1) | ||
x2 = IntrepppidDataset.static_encode(trunc_len, spp, p2) | ||
y_hat_logits = net(x1, x2) | ||
# The forward pass returns logits, so you need to activate with sigmoid | ||
y_hat = torch.sigmoid(y_hat_logits) | ||
or if you were using the INTREPPPID Data Module | ||
|
||
.. code:: python | ||
for batch in data_module.test_dataloader(): | ||
x1, x2, _, _, _, label = batch | ||
y_hat_logits = net(x1, x2) | ||
# The forward pass returns logits, so you need to activate with sigmoid | ||
y_hat = torch.sigmoid(y_hat_logits) |
Oops, something went wrong.