Merge pull request #116 from PNNL-CompBio/functionmotifs

PNNL-CompBio · Jun 5, 2024 · c3b8d91 · c3b8d91
2 parents 155e36c + 64f4fef
commit c3b8d91
Show file tree

Hide file tree

Showing 32 changed files with 3,098 additions and 3,264 deletions.
diff --git a/.github/workflows/action.yml b/.github/workflows/action.yml
@@ -52,7 +52,7 @@ jobs:
       - shell: bash -l {0}
         run: mamba install -y -c conda-forge snakemake==7.0 tabulate==0.8.10
       - shell: bash -l {0}
-        run: pip install -e git+https://github.com/PNNL-CompBio/Snekmer@kmer-association#egg=snekmer
+        run: pip install -e git+https://github.com/PNNL-CompBio/Snekmer@functionmotifs#egg=snekmer
 
       #test clustering step
       - name: Snekmer Cluster
@@ -105,3 +105,11 @@ jobs:
           source activate snekmer
           snekmer apply --configfile .test/config_learnapp.yaml -d .test --cores 1
           rm -rf .test/output
+
+      # run Snekmer Motif using previously generated model files
+      - name: Snekmer Motif
+        run: |
+          export PATH="/usr/share/miniconda/bin:$PATH"
+          source activate snekmer
+          snekmer motif --configfile .test/config.yaml -d .test --cores 1
+          rm -rf .test/output
diff --git a/.test/config.yaml b/.test/config.yaml
@@ -48,4 +48,6 @@ score_dir: "output/example-model/"
 learnapp:
   save_apply_associations: False
 
-
+# motif params
+motif:
+  n: 200
diff --git a/README.md b/README.md
@@ -15,7 +15,7 @@ to determine probabilistic annotations.
   <img align="center" src="resources/snekmer_workflow.svg">
 </p>
 
-There are 5 operation modes for Snekmer: `cluster`, `model`, `search`, `learn`, and `apply`.
+There are six operation modes for Snekmer: `cluster`, `model`, and `search`, `learn`, `apply`, and `motif`.
 
 **Cluster mode:** The user supplies files containing sequences in an appropriate format (e.g. FASTA).
 Snekmer applies the relevant workflow steps and outputs the resulting clustering results in tabular form (.CSV),
@@ -40,6 +40,10 @@ and the outputs received from Learn. Snekmer uses cosine distance to predict the
 sequence from the kmer counts matrix. The output is a table for each file containing sequence annotation
 predictions with confidence levels.
 
+**Motif mode:** The user supplies files containing sequences in an appropriate format (e.g. FASTA)
+and the outputs received from Model. Snekmer performs a feature selection workflow to produce a 
+list of motifs ordered by degree of conservation and a classification model using the selected features (.model).
+
 ## How to Use Snekmer
 
 For installation instructions, documentation, and more, refer to

diff --git a/docs/source/getting_started/cli.rst b/docs/source/getting_started/cli.rst
@@ -15,7 +15,7 @@ For an overview of Snekmer usage, reference the help command (``snekmer --help``
 .. code-block:: console
 
     $ snekmer --help
-    usage: snekmer [-h] [-v] {cluster,model,search,learn,apply} ...
+    usage: snekmer [-h] [-v] {cluster,model,search,learn,apply,motif} ...
 
     Snekmer: A tool for kmer-based sequence analysis using amino acid reduction (AAR)
 
@@ -26,7 +26,7 @@ For an overview of Snekmer usage, reference the help command (``snekmer --help``
     mode:
     Snekmer mode
 
-    {cluster,model,search,learn,apply}
+    {cluster,model,search,learn,apply,motif}
 
 Tailored references for the individual operation modes can be accessed
 via ``snekmer {mode} --help``.
@@ -49,7 +49,7 @@ files. Snekmer also assumes background files, if any, are stored in
 is shown below:
 
 
-Snekmer ``cluster``, ``model``, and ``search`` input
+Snekmer ``cluster``, ``model``, ``search``, and ``motif`` input
 
 .. code-block:: console
 

diff --git a/docs/source/getting_started/config.rst b/docs/source/getting_started/config.rst
@@ -131,3 +131,13 @@ General parameters related to Snekmer's learn and apply mode (``snekmer learn``,
  ``seed``                        ``int``               Choose any (random) seed for reproducible fragmentation.
 =============================  =====================  =========================================================================
 
+
+Motif Parameters
+````````````````
+The following parameters are required for Snekmer's motif mode (``snekmer motif``), wherein feature selection is performed to find functionally relevant kmers.
+
+========================  =====================  ==================================================================================
+     Parameter                    Type            Description
+========================  =====================  ==================================================================================
+``n``                     ``int``                Number of label permutation and rescoring iterations to run for each input family.
+========================  =====================  ==================================================================================
diff --git a/docs/source/getting_started/usage.rst b/docs/source/getting_started/usage.rst
@@ -1,9 +1,9 @@
 Using Snekmer
 =============
 
-Snekmer has three modeling operations: ``cluster`` (unsupervised clustering),
-``model`` (supervised modeling), and ``search`` (application
-of model to new sequences). We will call the first two modes
+Snekmer has four modeling operations: ``cluster`` (unsupervised clustering),
+``model`` (supervised modeling), ``search`` (application
+of model to new sequences), and ``motif`` (feature selection). We will call the first two modes
 **learning modes** due to their utility in learning relationships
 between protein family input files. Users may choose a mode to best
 suit their specific use case.
@@ -233,3 +233,32 @@ and directories in addition to the files described previously.
     │   │   ├── Seq-Annotation-Scores-D.csv  # (optional) Sequence-annotation cosine similarity scores for D seqs
     │   │   ├── kmer-summary-C.csv  # Results with annotation predictions and confidence for C seqs 
     │   │   └── kmer-summary-D.csv  # Results with annotation predictions and confidence for D seqs 
+
+Snekmer Motif Output Files
+::::::::::::::::::::::::::
+
+Snekmer's motif mode produces the following output files and directories in addition to the files described previously.
+
+.. code-block:: console
+
+    .
+    ├── output/
+    │   ├── ...
+    │   ├── motif/
+    │   │   ├── kmers/
+    │   │   │   ├── A.csv  # kmers retained for A after recursive feature elimination
+    │   │   │   ├── B.csv  # kmers retained for B after recursive feature elimination
+    │   │   ├── preselection/
+    │   │   │   ├── A.csv  # kmer weights learned for A after recursive feature elimination
+    │   │   │   ├── B.csv  # kmer weights learned for B after recursive feature elimination
+    │   │   │   ├── A.model  # last (A/not A) classification model trained during RFE
+    │   │   │   ├── B.model  # last (B/not B) classification model trained during RFE
+    │   │   ├── sequences/
+    │   │   │   ├── A.csv  # Sequence vectors for A using the kmer subset retained after recursive feature elimination
+    │   │   │   ├── B.csv  # Sequence vectors for B using the kmer subset retained after recursive feature elimination
+    │   │   ├── scores/
+    │   │   │   ├── A.csv  # kmer weight learned for A on each permute/rescore iteration
+    │   │   │   ├── B.csv  # kmer weight learned for B on each permute/rescore iteration
+    │   │   ├── p_values/
+    │   │   │   ├── A.csv  # Tabulated results for A
+    │   │   │   └── B.csv  # Tabulated results for B
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -19,7 +19,7 @@ sequences to predict the nearest annotation and generate a confidence score.
         :width: 700
         :alt: Snekmer workflow overview
 
-There are 5 operation modes for Snekmer: ``cluster``, ``model``, ``search``, ``learn``, and ``apply``.
+There are 6 operation modes for Snekmer: ``cluster``, ``model``, ``search``, ``motif``, ``learn``, and ``apply``.
 
 **Cluster mode:** The user supplies files containing sequences in an appropriate format (e.g. FASTA).
 Snekmer applies the relevant workflow steps and outputs the resulting clustering results in tabular form (.CSV),
@@ -34,6 +34,8 @@ displays K-fold cross validation results in the form of figures (AUC ROC and PR
 and the models they wish to search their sequences against. Snekmer applies the relevant workflow steps
 and outputs a table for each file containing model annotation probabilities for the given sequences.
 
+**Motif mode:** The user supplies files containing sequences in an appropriate format (e.g. FASTA). Snekmer applies the relevant workflow steps and outputs a table (.csv) for each family, which shows the SVM weight and associated p-value for each kmer.
+
 
 **Learn mode:** The user supplies files containing sequences in an appropriate format (e.g. FASTA) as well as an annotation file. Snekmer generates a kmer counts matrix with the summed kmer distribution of each annotation recognized from the sequence ID. Snekmer then performs a self-evaluation to assess confidence levels. There are two outputs, a counts matrix, and a global confidence distribution. 
 
@@ -61,6 +63,8 @@ The output is a table for each file containing sequence annotation predictions w
 
    tutorial/index
    tutorial/snekmer_demo
+   tutorial/snekmer_learnapp_tutorial
+   tutorial/snekmer_motif_tutorial
 
 .. toctree::
    :caption: Background

diff --git a/docs/source/tutorial/snekmer_demo.ipynb b/docs/source/tutorial/snekmer_demo.ipynb