deploy: 1fb690c

Emad-COMBINE-lab · Feb 12, 2024 · 68199e0 · 68199e0
commit 68199e0
Show file tree

Hide file tree

Showing 38 changed files with 6,050 additions and 0 deletions.
diff --git a/.buildinfo b/.buildinfo
@@ -0,0 +1,4 @@
+# Sphinx build info version 1
+# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: 0c43a31e5cf1e710c67910c0b0d7b0a8
+tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/.nojekyll b/.nojekyll
diff --git a/_sources/api.rst.txt b/_sources/api.rst.txt
@@ -0,0 +1,18 @@
+API
+===
+
+Data
+----
+
+.. autoclass:: intrepppid.data.ppi_oma.IntrepppidDataset
+   :members:
+   :special-members: __init__, __getitem__, __len__
+
+.. autoclass:: intrepppid.data.ppi_oma.IntrepppidDataModule
+   :members:
+   :special-members: __init__
+
+Network
+-------
+
+.. autofunction:: intrepppid.intrepppid_network
diff --git a/_sources/cli.rst.txt b/_sources/cli.rst.txt
@@ -0,0 +1,118 @@
+Command Line Interface
+======================
+
+INTREPPPID has a :abbr:`CLI (Command Line Interface)` which can be used to easily train INTREPPPID.
+
+Train
+-----
+
+To train the INTREPPPID model as it was in the manuscript, use the ``train e2e_rnn_triplet`` command:
+
+.. code:: bash
+
+    $ intrepppid train e2e_rnn_triplet DATASET.h5 spm.model 3 100 80 --seed 3927704 --vocab_size 250 --trunc_len 1500 --embedding_size 64 --rnn_num_layers 2 --rnn_dropout_rate 0.3 --variational_dropout false --bi_reduce last --workers 4 --embedding_droprate 0.3 --do_rate 0.3 --log_path logs/e2e_rnn_triplet --beta_classifier 2 --use_projection false --optimizer_type ranger21_xx --lr 1e-2
+
+.. list-table:: INTREPPPID Manuscript Values for ``e2e_rnn_triplet``
+   :widths: 25 25 25 50
+   :header-rows: 1
+
+   * - Argument/Flag
+     - Default
+     - Manuscript Value
+     - Description
+   * - ``PPI_DATASET_PATH``
+     - None
+     - See Data
+     - Path to the PPI dataset. Must be in the INTREPPPID HDF5 format.
+   * - ``SENTENCEPIECE_PATH``
+     - None
+     - See Data
+     - Path to the SentencePiece model.
+   * - ``C_TYPE``
+     - None
+     - ``3``
+     - Specifies which dataset in the INTREPPPID HDF5 dataset to use by specifying the C-type.
+   * - ``NUM_EPOCHS``
+     - None
+     - ``100``
+     - Number of epochs to train the model for.
+   * - ``BATCH_SIZE``
+     - None
+     - ``80``
+     - The number of samples to use in the batch.
+   * - ``--seed``
+     - None
+     - ``8675309`` or ``5353456`` or ``3927704`` depending on the experiment.
+     - The random seed. If not specified, chosen at random.
+   * - ``--vocab_size``
+     - ``250``
+     - ``250``
+     - The number of tokens in the SentencePiece vocabulary.
+   * - ``--trunc_len``
+     - ``1500``
+     - ``1500``
+     - Length at which to truncate sequences.
+   * - ``--embedding_size``
+     - ``64``
+     - ``64``
+     - The size of embeddings.
+   * - ``--rnn_num_layers``
+     - ``2``
+     - ``2``
+     - The number of layers in the AWD-LSTM encoder to use.
+   * - ``--rnn_dropout_rate``
+     - ``0.3``
+     - ``0.3``
+     - The dropconnect rate for the AWD-LSTM encoder.
+   * - ``--variational_dropout``
+     - ``false``
+     - ``false``
+     - Whether to use variational dropout, as described in the AWD-LSTM manuscript.
+   * - ``--bi_reduce``
+     - ``last``
+     - ``last``
+     - Method to reduce the two LSTM embeddings for both directions. Must be one of "concat", "max", "mean", "last".
+   * - ``--workers``
+     - ``4``
+     - ``4``
+     - The number of processes to use for the DataLoader.
+   * - ``--embedding_droprate``
+     - ``0.3``
+     - ``0.3``
+     - The amount of Embedding Dropout to use (a la AWD-LSTM).
+   * - ``--do_rate``
+     - ``0.3``
+     - ``0.3``
+     - The amount of dropout to use in the MLP Classifier.
+   * - ``--log_path``
+     - ``"./logs/e2e_rnn_triplet"``
+     - ``"./logs/e2e_rnn_triplet"``
+     - The path to save logs.
+   * - ``--encoder_only_steps``
+     - ``-1`` (No Steps)
+     - ``-1`` (No Steps)
+     - The number of steps to only train the encoder and not the classifier.
+   * - ``--classifier_warm_up``
+     - ``-1`` (No Steps)
+     - ``-1`` (No Steps)
+     - The number of steps to only train the classifier and not the encoder.
+   * - ``--beta_classifier``
+     - ``4`` (25% contribution of the classifier loss, 75% contribution of the orthologue loss)
+     - ``2`` (50% contribution of the classifier loss, 50% contribution of the orthologue loss)
+     - Adjusts the amount of weight to give the PPI Classification loss, relative to the Orthologue Locality loss. The loss becomes (1/β)×(classifier_loss) + [1-(1/β)]×(orthologue_loss).
+   * - ``--lr``
+     - ``1e-2``
+     - ``1e-2``
+     - Learning rate to use.
+   * - ``--use_projection``
+     - ``false``
+     - ``false``
+     - Whether to use a projection network after the encoder.
+   * - ``--checkpoint_path``
+     - ``log_path / model_name / "chkpt"``
+     - ``log_path / model_name / "chkpt"``
+     - The location where checkpoints are to be saved.
+   * - ``--optimizer_type``
+     - ``ranger21``
+     - ``ranger21_xx``
+     - The optimizer to use while training. Must be one of ``ranger21``, ``ranger21_xx``, ``adamw``, ``adamw_1cycle``, or ``adamw_cosine``.
diff --git a/_sources/data.rst.txt b/_sources/data.rst.txt
@@ -0,0 +1,115 @@
+Data
+====
+
+Precomputed Datasets
+--------------------
+
+You can download precomputed datasets from the sources below:
+
+1. `Zenodo <https://doi.org/10.5281/zenodo.10594149>`_ (DOI: 10.5281/zenodo.10594149)
+2. `Internet Archive <https://archive.org/details/intrepppid_datasets.tar>`_
+
+All datasets are made available under the `Creative Commons Attribution-ShareAlike 4.0 International <https://creativecommons.org/licenses/by-sa/4.0/legalcode>`_ license.
+
+Dataset Format
+--------------
+
+INTREPPPID requires that datasets be prepared specifically in `HDF5 <https://en.wikipedia.org/wiki/Hierarchical_Data_Format>`_ files.
+
+Each INTREPPPID dataset must have the following hierarchical structure
+
+.. code::
+
+   intrepppid.h5
+   ├── orthologs
+   ├── sequences
+   │
+   ├── splits
+   │   ├── test
+   │   ├── train
+   │   └── val
+   │
+   └── interactions
+       ├── c1
+       │    ├── c1_train
+       │    ├── c1_val
+       │    └── c1_test
+       │
+       ├── c2
+       │    ├── c2_train
+       │    ├── c2_val
+       │    └── c2_test
+       │
+       └── c3
+            ├── c2_train
+            ├── c2_val
+            └── c2_test
+
+All but one of the "c" folders under "interactions" need be present, so long as that is the dataset you specify in the train step with the ``--c_type`` flag.
+
+Here is the schema for the tables:
+
+.. list-table:: ``orthologs`` schema
+   :widths: 25 25 25 50
+   :header-rows: 1
+
+   * - Field Name
+     - Type
+     - Example
+     - Description
+   * - ``ortholog_group_id``
+     - ``Int64``
+     - ``1048576``
+     - The `OMA <https://omabrowser.org/oma/home/>`_ Group ID of the protein in the ``protein_id`` column
+   * - ``protein_id``
+     - ``String``
+     - ``M7ZLH0``
+     - The `UniProt <https://www.uniprot.org/>`_ accession of a protein with OMA Group ID ``ortholog_group_id``
+
+.. list-table:: ``sequences`` schema
+   :widths: 25 25 25 50
+   :header-rows: 1
+
+   * - Field Name
+     - Type
+     - Example
+     - Description
+   * - ``name``
+     - ``String``
+     - ``Q9NZE8``
+     - The `UniProt <https://www.uniprot.org/>`_ accession that corresponds to the amino acid sequence in the ``sequence`` column.
+   * - ``sequence``
+     - ``String``
+     - ``MAASAFAGAVRAASGILRPLNI``...
+     - The amino acid sequence indicated by the ``name`` column.
+
+.. list-table:: Schema for all tables under ``interactions``
+   :widths: 25 25 25 50
+   :header-rows: 1
+
+   * - Field Name
+     - Type
+     - Example
+     - Description
+   * - ``protein_id1``
+     - ``String``
+     - ``Q9BQB4``
+     - The `UniProt <https://www.uniprot.org/>`_ accession of the first protein in the interaction pair.
+   * - ``protein_id2``
+     - ``String``
+     - ``Q9NYF0``
+     - The `UniProt <https://www.uniprot.org/>`_ accession of the second protein in the interaction pair.
+   * - ``omid_protein_id``
+     - ``String``
+     - ``C1MTX6``
+     - The `UniProt <https://www.uniprot.org/>`_ accession of the anchor protein for the orthologous locality loss.
+   * - ``omid_id``
+     - ``Int64``
+     - ``737336``
+     - The `OMA <https://omabrowser.org/oma/home/>`_ Group ID of the anchor protein, from which a positive protein can be chose for the orthologous locality loss.
+   * - ``label``
+     - ``Bool``
+     - ``False``
+     - Label indicating whether ``protein_id1`` and ``protein_id2`` interact with one another.
+
+Everything under the
diff --git a/_sources/guide.rst.txt b/_sources/guide.rst.txt
@@ -0,0 +1,127 @@
+User Guide
+==========
+
+Training
+--------
+
+The easiest way to start training INTREPPPID is to use the :doc:`CLI <cli>`.
+
+An example of running the training loop with the values used in the INTREPPPID manuscript is as follows:
+
+.. code:: bash
+
+    $ intrepppid train e2e_rnn_triplet DATASET.h5 spm.model 3 100 80 --seed 3927704 --vocab_size 250 --trunc_len 1500 --embedding_size 64 --rnn_num_layers 2 --rnn_dropout_rate 0.3 --variational_dropout false --bi_reduce last --workers 4 --embedding_droprate 0.3 --do_rate 0.3 --log_path logs/e2e_rnn_triplet --beta_classifier 2 --use_projection false --optimizer_type ranger21_xx --lr 1e-2
+
+Checkpoints will be saved in a folder ``logs/e2e_rnn_triplet/model_name/chkpt`` and can be used for inference.
+
+Inference
+---------
+
+The easiest way to infer using INTREPPPID is through the website `https://PPI.bio <https://ppi.bio>`_. However, you may wish to infer locally using INTREPPID for various reasons, `e.g.`: to infer using your own custom checkpoints.
+
+Preparing Data
+^^^^^^^^^^^^^^
+
+To infer using INTREPPPID, you'll have to use the :doc:`API <api>`.
+
+The first step is to get the amino acid sequences you want to infer. This can be as simple as defining a list of sequence pairs:
+
+.. code:: python
+
+    sequence_pairs = [
+       ("MANQRLS","MGPLSS"),
+       ("MQQNLSS","MPWNLS"),
+    ]
+
+You'll need to encode all the sequence, and you'll need to use the same settings that were used during training. Using the same parameters as used in the dataset:
+
+.. code:: python
+
+    from intrepppid.data.ppi_oma import IntrepppidDataset
+    import sentencepiece as sp
+
+    trunc_len = 1500
+    spp = sp.SentencePieceProcessor(model_file=SPM_FILE)
+
+    encoded_sequence_pairs = []
+
+    for p1, p2 in sequence_pairs:
+        x1 = IntrepppidDataset.static_encode(trunc_len, spp, p1)
+        x2 = IntrepppidDataset.static_encode(trunc_len, spp, p2)
+
+        # Infer interactions here
+
+
+Alternatively, you may be interested in loading sequences from an INTREPPPID dataset to do testing. You can use the :py:class:`intrepppid.data.ppi_oma.IntrepppidDataModule`.
+
+.. code:: python
+
+    from intrepppid.data.ppi_oma import IntrepppidDataModule
+
+    batch_size = 80
+
+    data_module = IntrepppidDataModule(
+        batch_size = batch_size,
+        dataset_path = DATASET_PATH,
+        c_type = 3,
+        trunc_len = 1500,
+        workers = 4,
+        vocab_size = 250,
+        model_file = SPM_FILE,
+        seed = 8675309,
+        sos = False,
+        eos = False,
+        negative_omid = True
+    )
+
+    data_module.setup()
+
+    for batch in data_module.test_dataloader():
+        p1_seq, p2_seq, _, _, _, label = batch
+        # Infer interactions here
+
+Load the INTREPPPID network
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+We must now instantiate the INTREPPPID network and load weights.
+
+If you trained the INTREPPPID with the manuscript defaults, you pass any values to :py:func:`intrepppid.intrepppid_network`.
+
+.. code:: python
+
+    from intrepppid import intrepppid_network
+
+    # steps_per_epoch is 0 here because it is not used for inference
+    net = intrepppid_network(0)
+
+    net.eval()
+
+    chkpt = torch.load(CHECKPOINT_PATH)
+
+    net.load_state_dict(chkpt['state_dict'])
+
+Infer Interactions
+^^^^^^^^^^^^^^^^^^
+
+Putting everything together, you get:
+
+.. code:: python
+
+    for p1, p2 in sequence_pairs:
+        x1 = IntrepppidDataset.static_encode(trunc_len, spp, p1)
+        x2 = IntrepppidDataset.static_encode(trunc_len, spp, p2)
+
+        y_hat_logits = net(x1, x2)
+        # The forward pass returns logits, so you need to activate with sigmoid
+        y_hat = torch.sigmoid(y_hat_logits)
+
+or if you were using the INTREPPPID Data Module
+
+.. code:: python
+
+    for batch in data_module.test_dataloader():
+        x1, x2, _, _, _, label = batch
+
+        y_hat_logits = net(x1, x2)
+        # The forward pass returns logits, so you need to activate with sigmoid
+        y_hat = torch.sigmoid(y_hat_logits)