Skip to content

Commit

Permalink
deploy: 1fb690c
Browse files Browse the repository at this point in the history
  • Loading branch information
jszym committed Feb 12, 2024
0 parents commit 68199e0
Show file tree
Hide file tree
Showing 38 changed files with 6,050 additions and 0 deletions.
4 changes: 4 additions & 0 deletions .buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 0c43a31e5cf1e710c67910c0b0d7b0a8
tags: 645f666f9bcd5a90fca523b33c5a78b7
Empty file added .nojekyll
Empty file.
18 changes: 18 additions & 0 deletions _sources/api.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
API
===

Data
----

.. autoclass:: intrepppid.data.ppi_oma.IntrepppidDataset
:members:
:special-members: __init__, __getitem__, __len__

.. autoclass:: intrepppid.data.ppi_oma.IntrepppidDataModule
:members:
:special-members: __init__

Network
-------

.. autofunction:: intrepppid.intrepppid_network
118 changes: 118 additions & 0 deletions _sources/cli.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
Command Line Interface
======================

INTREPPPID has a :abbr:`CLI (Command Line Interface)` which can be used to easily train INTREPPPID.

Train
-----

To train the INTREPPPID model as it was in the manuscript, use the ``train e2e_rnn_triplet`` command:

.. code:: bash
$ intrepppid train e2e_rnn_triplet DATASET.h5 spm.model 3 100 80 --seed 3927704 --vocab_size 250 --trunc_len 1500 --embedding_size 64 --rnn_num_layers 2 --rnn_dropout_rate 0.3 --variational_dropout false --bi_reduce last --workers 4 --embedding_droprate 0.3 --do_rate 0.3 --log_path logs/e2e_rnn_triplet --beta_classifier 2 --use_projection false --optimizer_type ranger21_xx --lr 1e-2
.. list-table:: INTREPPPID Manuscript Values for ``e2e_rnn_triplet``
:widths: 25 25 25 50
:header-rows: 1

* - Argument/Flag
- Default
- Manuscript Value
- Description
* - ``PPI_DATASET_PATH``
- None
- See Data
- Path to the PPI dataset. Must be in the INTREPPPID HDF5 format.
* - ``SENTENCEPIECE_PATH``
- None
- See Data
- Path to the SentencePiece model.
* - ``C_TYPE``
- None
- ``3``
- Specifies which dataset in the INTREPPPID HDF5 dataset to use by specifying the C-type.
* - ``NUM_EPOCHS``
- None
- ``100``
- Number of epochs to train the model for.
* - ``BATCH_SIZE``
- None
- ``80``
- The number of samples to use in the batch.
* - ``--seed``
- None
- ``8675309`` or ``5353456`` or ``3927704`` depending on the experiment.
- The random seed. If not specified, chosen at random.
* - ``--vocab_size``
- ``250``
- ``250``
- The number of tokens in the SentencePiece vocabulary.
* - ``--trunc_len``
- ``1500``
- ``1500``
- Length at which to truncate sequences.
* - ``--embedding_size``
- ``64``
- ``64``
- The size of embeddings.
* - ``--rnn_num_layers``
- ``2``
- ``2``
- The number of layers in the AWD-LSTM encoder to use.
* - ``--rnn_dropout_rate``
- ``0.3``
- ``0.3``
- The dropconnect rate for the AWD-LSTM encoder.
* - ``--variational_dropout``
- ``false``
- ``false``
- Whether to use variational dropout, as described in the AWD-LSTM manuscript.
* - ``--bi_reduce``
- ``last``
- ``last``
- Method to reduce the two LSTM embeddings for both directions. Must be one of "concat", "max", "mean", "last".
* - ``--workers``
- ``4``
- ``4``
- The number of processes to use for the DataLoader.
* - ``--embedding_droprate``
- ``0.3``
- ``0.3``
- The amount of Embedding Dropout to use (a la AWD-LSTM).
* - ``--do_rate``
- ``0.3``
- ``0.3``
- The amount of dropout to use in the MLP Classifier.
* - ``--log_path``
- ``"./logs/e2e_rnn_triplet"``
- ``"./logs/e2e_rnn_triplet"``
- The path to save logs.
* - ``--encoder_only_steps``
- ``-1`` (No Steps)
- ``-1`` (No Steps)
- The number of steps to only train the encoder and not the classifier.
* - ``--classifier_warm_up``
- ``-1`` (No Steps)
- ``-1`` (No Steps)
- The number of steps to only train the classifier and not the encoder.
* - ``--beta_classifier``
- ``4`` (25% contribution of the classifier loss, 75% contribution of the orthologue loss)
- ``2`` (50% contribution of the classifier loss, 50% contribution of the orthologue loss)
- Adjusts the amount of weight to give the PPI Classification loss, relative to the Orthologue Locality loss. The loss becomes (1/β)×(classifier_loss) + [1-(1/β)]×(orthologue_loss).
* - ``--lr``
- ``1e-2``
- ``1e-2``
- Learning rate to use.
* - ``--use_projection``
- ``false``
- ``false``
- Whether to use a projection network after the encoder.
* - ``--checkpoint_path``
- ``log_path / model_name / "chkpt"``
- ``log_path / model_name / "chkpt"``
- The location where checkpoints are to be saved.
* - ``--optimizer_type``
- ``ranger21``
- ``ranger21_xx``
- The optimizer to use while training. Must be one of ``ranger21``, ``ranger21_xx``, ``adamw``, ``adamw_1cycle``, or ``adamw_cosine``.
115 changes: 115 additions & 0 deletions _sources/data.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
Data
====

Precomputed Datasets
--------------------

You can download precomputed datasets from the sources below:

1. `Zenodo <https://doi.org/10.5281/zenodo.10594149>`_ (DOI: 10.5281/zenodo.10594149)
2. `Internet Archive <https://archive.org/details/intrepppid_datasets.tar>`_

All datasets are made available under the `Creative Commons Attribution-ShareAlike 4.0 International <https://creativecommons.org/licenses/by-sa/4.0/legalcode>`_ license.

Dataset Format
--------------

INTREPPPID requires that datasets be prepared specifically in `HDF5 <https://en.wikipedia.org/wiki/Hierarchical_Data_Format>`_ files.

Each INTREPPPID dataset must have the following hierarchical structure

.. code::
intrepppid.h5
├── orthologs
├── sequences
├── splits
│ ├── test
│ ├── train
│ └── val
└── interactions
├── c1
│ ├── c1_train
│ ├── c1_val
│ └── c1_test
├── c2
│ ├── c2_train
│ ├── c2_val
│ └── c2_test
└── c3
├── c2_train
├── c2_val
└── c2_test
All but one of the "c" folders under "interactions" need be present, so long as that is the dataset you specify in the train step with the ``--c_type`` flag.

Here is the schema for the tables:

.. list-table:: ``orthologs`` schema
:widths: 25 25 25 50
:header-rows: 1

* - Field Name
- Type
- Example
- Description
* - ``ortholog_group_id``
- ``Int64``
- ``1048576``
- The `OMA <https://omabrowser.org/oma/home/>`_ Group ID of the protein in the ``protein_id`` column
* - ``protein_id``
- ``String``
- ``M7ZLH0``
- The `UniProt <https://www.uniprot.org/>`_ accession of a protein with OMA Group ID ``ortholog_group_id``

.. list-table:: ``sequences`` schema
:widths: 25 25 25 50
:header-rows: 1

* - Field Name
- Type
- Example
- Description
* - ``name``
- ``String``
- ``Q9NZE8``
- The `UniProt <https://www.uniprot.org/>`_ accession that corresponds to the amino acid sequence in the ``sequence`` column.
* - ``sequence``
- ``String``
- ``MAASAFAGAVRAASGILRPLNI``...
- The amino acid sequence indicated by the ``name`` column.

.. list-table:: Schema for all tables under ``interactions``
:widths: 25 25 25 50
:header-rows: 1

* - Field Name
- Type
- Example
- Description
* - ``protein_id1``
- ``String``
- ``Q9BQB4``
- The `UniProt <https://www.uniprot.org/>`_ accession of the first protein in the interaction pair.
* - ``protein_id2``
- ``String``
- ``Q9NYF0``
- The `UniProt <https://www.uniprot.org/>`_ accession of the second protein in the interaction pair.
* - ``omid_protein_id``
- ``String``
- ``C1MTX6``
- The `UniProt <https://www.uniprot.org/>`_ accession of the anchor protein for the orthologous locality loss.
* - ``omid_id``
- ``Int64``
- ``737336``
- The `OMA <https://omabrowser.org/oma/home/>`_ Group ID of the anchor protein, from which a positive protein can be chose for the orthologous locality loss.
* - ``label``
- ``Bool``
- ``False``
- Label indicating whether ``protein_id1`` and ``protein_id2`` interact with one another.

Everything under the
127 changes: 127 additions & 0 deletions _sources/guide.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
User Guide
==========

Training
--------

The easiest way to start training INTREPPPID is to use the :doc:`CLI <cli>`.

An example of running the training loop with the values used in the INTREPPPID manuscript is as follows:

.. code:: bash
$ intrepppid train e2e_rnn_triplet DATASET.h5 spm.model 3 100 80 --seed 3927704 --vocab_size 250 --trunc_len 1500 --embedding_size 64 --rnn_num_layers 2 --rnn_dropout_rate 0.3 --variational_dropout false --bi_reduce last --workers 4 --embedding_droprate 0.3 --do_rate 0.3 --log_path logs/e2e_rnn_triplet --beta_classifier 2 --use_projection false --optimizer_type ranger21_xx --lr 1e-2
Checkpoints will be saved in a folder ``logs/e2e_rnn_triplet/model_name/chkpt`` and can be used for inference.

Inference
---------

The easiest way to infer using INTREPPPID is through the website `https://PPI.bio <https://ppi.bio>`_. However, you may wish to infer locally using INTREPPID for various reasons, `e.g.`: to infer using your own custom checkpoints.

Preparing Data
^^^^^^^^^^^^^^

To infer using INTREPPPID, you'll have to use the :doc:`API <api>`.

The first step is to get the amino acid sequences you want to infer. This can be as simple as defining a list of sequence pairs:

.. code:: python
sequence_pairs = [
("MANQRLS","MGPLSS"),
("MQQNLSS","MPWNLS"),
]
You'll need to encode all the sequence, and you'll need to use the same settings that were used during training. Using the same parameters as used in the dataset:

.. code:: python
from intrepppid.data.ppi_oma import IntrepppidDataset
import sentencepiece as sp
trunc_len = 1500
spp = sp.SentencePieceProcessor(model_file=SPM_FILE)
encoded_sequence_pairs = []
for p1, p2 in sequence_pairs:
x1 = IntrepppidDataset.static_encode(trunc_len, spp, p1)
x2 = IntrepppidDataset.static_encode(trunc_len, spp, p2)
# Infer interactions here
Alternatively, you may be interested in loading sequences from an INTREPPPID dataset to do testing. You can use the :py:class:`intrepppid.data.ppi_oma.IntrepppidDataModule`.

.. code:: python
from intrepppid.data.ppi_oma import IntrepppidDataModule
batch_size = 80
data_module = IntrepppidDataModule(
batch_size = batch_size,
dataset_path = DATASET_PATH,
c_type = 3,
trunc_len = 1500,
workers = 4,
vocab_size = 250,
model_file = SPM_FILE,
seed = 8675309,
sos = False,
eos = False,
negative_omid = True
)
data_module.setup()
for batch in data_module.test_dataloader():
p1_seq, p2_seq, _, _, _, label = batch
# Infer interactions here
Load the INTREPPPID network
^^^^^^^^^^^^^^^^^^^^^^^^^^^

We must now instantiate the INTREPPPID network and load weights.

If you trained the INTREPPPID with the manuscript defaults, you pass any values to :py:func:`intrepppid.intrepppid_network`.

.. code:: python
from intrepppid import intrepppid_network
# steps_per_epoch is 0 here because it is not used for inference
net = intrepppid_network(0)
net.eval()
chkpt = torch.load(CHECKPOINT_PATH)
net.load_state_dict(chkpt['state_dict'])
Infer Interactions
^^^^^^^^^^^^^^^^^^

Putting everything together, you get:

.. code:: python
for p1, p2 in sequence_pairs:
x1 = IntrepppidDataset.static_encode(trunc_len, spp, p1)
x2 = IntrepppidDataset.static_encode(trunc_len, spp, p2)
y_hat_logits = net(x1, x2)
# The forward pass returns logits, so you need to activate with sigmoid
y_hat = torch.sigmoid(y_hat_logits)
or if you were using the INTREPPPID Data Module

.. code:: python
for batch in data_module.test_dataloader():
x1, x2, _, _, _, label = batch
y_hat_logits = net(x1, x2)
# The forward pass returns logits, so you need to activate with sigmoid
y_hat = torch.sigmoid(y_hat_logits)
Loading

0 comments on commit 68199e0

Please sign in to comment.