Skip to content

Commit

Permalink
Merge pull request #116 from PNNL-CompBio/functionmotifs
Browse files Browse the repository at this point in the history
  • Loading branch information
tnitka authored Jun 5, 2024
2 parents 155e36c + 64f4fef commit c3b8d91
Show file tree
Hide file tree
Showing 32 changed files with 3,098 additions and 3,264 deletions.
10 changes: 9 additions & 1 deletion .github/workflows/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ jobs:
- shell: bash -l {0}
run: mamba install -y -c conda-forge snakemake==7.0 tabulate==0.8.10
- shell: bash -l {0}
run: pip install -e git+https://github.com/PNNL-CompBio/Snekmer@kmer-association#egg=snekmer
run: pip install -e git+https://github.com/PNNL-CompBio/Snekmer@functionmotifs#egg=snekmer

#test clustering step
- name: Snekmer Cluster
Expand Down Expand Up @@ -105,3 +105,11 @@ jobs:
source activate snekmer
snekmer apply --configfile .test/config_learnapp.yaml -d .test --cores 1
rm -rf .test/output
# run Snekmer Motif using previously generated model files
- name: Snekmer Motif
run: |
export PATH="/usr/share/miniconda/bin:$PATH"
source activate snekmer
snekmer motif --configfile .test/config.yaml -d .test --cores 1
rm -rf .test/output
4 changes: 3 additions & 1 deletion .test/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -48,4 +48,6 @@ score_dir: "output/example-model/"
learnapp:
save_apply_associations: False


# motif params
motif:
n: 200
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ to determine probabilistic annotations.
<img align="center" src="resources/snekmer_workflow.svg">
</p>

There are 5 operation modes for Snekmer: `cluster`, `model`, `search`, `learn`, and `apply`.
There are six operation modes for Snekmer: `cluster`, `model`, and `search`, `learn`, `apply`, and `motif`.

**Cluster mode:** The user supplies files containing sequences in an appropriate format (e.g. FASTA).
Snekmer applies the relevant workflow steps and outputs the resulting clustering results in tabular form (.CSV),
Expand All @@ -40,6 +40,10 @@ and the outputs received from Learn. Snekmer uses cosine distance to predict the
sequence from the kmer counts matrix. The output is a table for each file containing sequence annotation
predictions with confidence levels.

**Motif mode:** The user supplies files containing sequences in an appropriate format (e.g. FASTA)
and the outputs received from Model. Snekmer performs a feature selection workflow to produce a
list of motifs ordered by degree of conservation and a classification model using the selected features (.model).

## How to Use Snekmer

For installation instructions, documentation, and more, refer to
Expand Down
6 changes: 3 additions & 3 deletions docs/source/getting_started/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ For an overview of Snekmer usage, reference the help command (``snekmer --help``
.. code-block:: console
$ snekmer --help
usage: snekmer [-h] [-v] {cluster,model,search,learn,apply} ...
usage: snekmer [-h] [-v] {cluster,model,search,learn,apply,motif} ...
Snekmer: A tool for kmer-based sequence analysis using amino acid reduction (AAR)
Expand All @@ -26,7 +26,7 @@ For an overview of Snekmer usage, reference the help command (``snekmer --help``
mode:
Snekmer mode
{cluster,model,search,learn,apply}
{cluster,model,search,learn,apply,motif}
Tailored references for the individual operation modes can be accessed
via ``snekmer {mode} --help``.
Expand All @@ -49,7 +49,7 @@ files. Snekmer also assumes background files, if any, are stored in
is shown below:


Snekmer ``cluster``, ``model``, and ``search`` input
Snekmer ``cluster``, ``model``, ``search``, and ``motif`` input

.. code-block:: console
Expand Down
10 changes: 10 additions & 0 deletions docs/source/getting_started/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -131,3 +131,13 @@ General parameters related to Snekmer's learn and apply mode (``snekmer learn``,
``seed`` ``int`` Choose any (random) seed for reproducible fragmentation.
============================= ===================== =========================================================================


Motif Parameters
````````````````
The following parameters are required for Snekmer's motif mode (``snekmer motif``), wherein feature selection is performed to find functionally relevant kmers.

======================== ===================== ==================================================================================
Parameter Type Description
======================== ===================== ==================================================================================
``n`` ``int`` Number of label permutation and rescoring iterations to run for each input family.
======================== ===================== ==================================================================================
35 changes: 32 additions & 3 deletions docs/source/getting_started/usage.rst
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
Using Snekmer
=============

Snekmer has three modeling operations: ``cluster`` (unsupervised clustering),
``model`` (supervised modeling), and ``search`` (application
of model to new sequences). We will call the first two modes
Snekmer has four modeling operations: ``cluster`` (unsupervised clustering),
``model`` (supervised modeling), ``search`` (application
of model to new sequences), and ``motif`` (feature selection). We will call the first two modes
**learning modes** due to their utility in learning relationships
between protein family input files. Users may choose a mode to best
suit their specific use case.
Expand Down Expand Up @@ -233,3 +233,32 @@ and directories in addition to the files described previously.
│ │ ├── Seq-Annotation-Scores-D.csv # (optional) Sequence-annotation cosine similarity scores for D seqs
│ │ ├── kmer-summary-C.csv # Results with annotation predictions and confidence for C seqs
│ │ └── kmer-summary-D.csv # Results with annotation predictions and confidence for D seqs
Snekmer Motif Output Files
::::::::::::::::::::::::::

Snekmer's motif mode produces the following output files and directories in addition to the files described previously.

.. code-block:: console
.
├── output/
│ ├── ...
│ ├── motif/
│ │ ├── kmers/
│ │ │ ├── A.csv # kmers retained for A after recursive feature elimination
│ │ │ ├── B.csv # kmers retained for B after recursive feature elimination
│ │ ├── preselection/
│ │ │ ├── A.csv # kmer weights learned for A after recursive feature elimination
│ │ │ ├── B.csv # kmer weights learned for B after recursive feature elimination
│ │ │ ├── A.model # last (A/not A) classification model trained during RFE
│ │ │ ├── B.model # last (B/not B) classification model trained during RFE
│ │ ├── sequences/
│ │ │ ├── A.csv # Sequence vectors for A using the kmer subset retained after recursive feature elimination
│ │ │ ├── B.csv # Sequence vectors for B using the kmer subset retained after recursive feature elimination
│ │ ├── scores/
│ │ │ ├── A.csv # kmer weight learned for A on each permute/rescore iteration
│ │ │ ├── B.csv # kmer weight learned for B on each permute/rescore iteration
│ │ ├── p_values/
│ │ │ ├── A.csv # Tabulated results for A
│ │ │ └── B.csv # Tabulated results for B
6 changes: 5 additions & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ sequences to predict the nearest annotation and generate a confidence score.
:width: 700
:alt: Snekmer workflow overview

There are 5 operation modes for Snekmer: ``cluster``, ``model``, ``search``, ``learn``, and ``apply``.
There are 6 operation modes for Snekmer: ``cluster``, ``model``, ``search``, ``motif``, ``learn``, and ``apply``.

**Cluster mode:** The user supplies files containing sequences in an appropriate format (e.g. FASTA).
Snekmer applies the relevant workflow steps and outputs the resulting clustering results in tabular form (.CSV),
Expand All @@ -34,6 +34,8 @@ displays K-fold cross validation results in the form of figures (AUC ROC and PR
and the models they wish to search their sequences against. Snekmer applies the relevant workflow steps
and outputs a table for each file containing model annotation probabilities for the given sequences.

**Motif mode:** The user supplies files containing sequences in an appropriate format (e.g. FASTA). Snekmer applies the relevant workflow steps and outputs a table (.csv) for each family, which shows the SVM weight and associated p-value for each kmer.


**Learn mode:** The user supplies files containing sequences in an appropriate format (e.g. FASTA) as well as an annotation file. Snekmer generates a kmer counts matrix with the summed kmer distribution of each annotation recognized from the sequence ID. Snekmer then performs a self-evaluation to assess confidence levels. There are two outputs, a counts matrix, and a global confidence distribution.

Expand Down Expand Up @@ -61,6 +63,8 @@ The output is a table for each file containing sequence annotation predictions w

tutorial/index
tutorial/snekmer_demo
tutorial/snekmer_learnapp_tutorial
tutorial/snekmer_motif_tutorial

.. toctree::
:caption: Background
Expand Down
10 changes: 5 additions & 5 deletions docs/source/tutorial/snekmer_demo.ipynb

Large diffs are not rendered by default.

Loading

0 comments on commit c3b8d91

Please sign in to comment.