Added AlphaFind

Coda-Research-Group · Jan 30, 2024 · 5fa6d27 · 5fa6d27
1 parent f5568ac
commit 5fa6d27
Show file tree

Hide file tree

Showing 249 changed files with 329,228 additions and 0 deletions.
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1,6 @@
+# Set the default behavior for all files
+* text=auto
+
+# Explicitly declare files that will always have Unix-style line endings
+*.sh text eol=lf
+*.py text eol=lf
diff --git a/README.md b/README.md
@@ -0,0 +1,39 @@
+<div align="center">
+  <br>
+  <br>
+  <a href="https://github.com/Coda-Research-Group/AlphaFind"><img src="https://raw.githubusercontent.com/Coda-Research-Group/AlphaFind/main/static/logo.png" alt="AlphaCharges" width="220"></a>
+  <br>
+  <br>
+</div>
+
+# AlphaFind: Discover structure similarity across the entire known proteome
+
+**[AlphaFind](https://alphafind.fi.muni.cz)** is a web-based search engine that allows for structure-based search of the entire [AlphaFold Protein Structure Database](https://alphafold.ebi.ac.uk). Uniprot ID, PDB ID, or Gene Symbol is accepted as input – the engine will return the most similar proteins found within AlphaFold DB, with an option for additional search to extend and refine the results. The search results are grouped by their source organism and displayed along with several similarity metrics. 3D visualizations of the structural superposition of the proteins are provided, and text filters can be used to find specific organisms or Uniprot IDs. For details about the methodology and usage, please see the [manual](https://github.com/Coda-Research-Group/AlphaFind/wiki/Manual). This website is free and open to all users and there is no login requirement.
+
+Vector embeddings and model weights used in [AlphaFind](https://alphafind.fi.muni.cz) are available at [AlphaFind: Discover structure similarity across the entire known proteome – data and model | Czech national repository](https://data.narodni-repozitar.cz/general/datasets/d35zf-1ja47).
+This project uses [USalign](https://github.com/pylelab/USalign).
+
+## Code Structure
+
+The codebase is divided into three folders:
+- `training` (model training, index building)
+- `api` (backend)
+- `ui` (frontend)
+
+See the `README.md` files in each folder for more details.
+
+## Running locally
+
+Prerequisites:
+- [Docker](https://docs.docker.com/get-docker/)
+
+1. Clone this repository
+2. Run `./run.sh` in your terminal
+3. Open `http://localhost:8081` in your browser
+
+The `training/data/cifs` folder contains a small subset of the AlphaFold DB comprising 109 proteins.
+The full AlphaFold DB can be downloaded from [here](https://alphafold.ebi.ac.uk/download).
+
+## License
+
+MIT license
diff --git a/api/.flake8 b/api/.flake8
@@ -0,0 +1,4 @@
+[flake8]
+max-line-length = 120
+max-complexity = 7
+extend-ignore = E203
diff --git a/api/.gitignore b/api/.gitignore
@@ -0,0 +1,12 @@
+__pycache__/
+.ipynb_checkpoints/
+wandb/
+.idea/
+config_.yaml
+pod_.yaml
+secret.yaml
+kubectl
+*.h5
+data/
+eph/*/
+models/
diff --git a/api/.pre-commit-config.yaml b/api/.pre-commit-config.yaml
@@ -0,0 +1,18 @@
+repos:
+  - repo: https://github.com/psf/black
+    rev: 23.7.0
+    hooks:
+      - id: black
+        args:
+          - --check
+          - --diff
+  - repo: https://github.com/PyCQA/isort
+    rev: 5.12.0
+    hooks:
+      - id: isort
+        args:
+          - --check-only
+  - repo: https://github.com/PyCQA/flake8
+    rev: 6.1.0
+    hooks:
+      - id: flake8
diff --git a/api/README.md b/api/README.md
@@ -0,0 +1,45 @@
+# AlphaFind API
+
+[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit)
+
+This project uses [USalign](https://github.com/pylelab/USalign).
+Vector embeddings and model weights used in [AlphaFind](https://alphafind.fi.muni.cz) are available at [AlphaFind: Discover structure similarity across the entire known proteome – data and model | Czech national repository](https://data.narodni-repozitar.cz/general/datasets/d35zf-1ja47).
+
+## Running locally
+
+1. Copy folders `data` and `models` from `alphafind-training` to the root of this repository.
+
+```shell
+ln -s ../alphafind-training/models/ models/
+ln -s ../alphafind-training/data/ data/
+```
+
+2. Run the following commands:
+
+```shell
+# Build the server image
+docker build -t alphafind:server -f ./server/Dockerfile .
+
+# Run the server
+docker run -p 8080:8000 \
+    -v ./data:/data \
+    -v ./models:/models \
+    -v ./eph:/eph \
+    alphafind:server
+
+Note: On **Windows** you may need to use absolute paths instead of relative paths.
+
+# Example query
+curl 'http://localhost:8080/search?query=A0A0C5PVI1'
+```
+
+## Installing dependencies
+
+```shell
+# Production environment
+pip install -r requirements.txt
+
+# Development environment
+pip install -r requirements-dev.txt
+pre-commit install
+```
diff --git a/api/USalign b/api/USalign
diff --git a/api/USalign_LICENSE b/api/USalign_LICENSE
@@ -0,0 +1,15 @@
+   US-align: universal structure alignment of monomeric and complex proteins
+   and nucleic acids
+
+   References to cite:
+   (1) Chengxin Zhang, Morgan Shine, Anna Marie Pyle, Yang Zhang
+       (2022) Nat Methods. 19(9), 1109-1115.
+   (2) Chengxin Zhang, Anna Marie Pyle (2022) iScience. 25(10), 105218.
+
+   DISCLAIMER:
+     Permission to use, copy, modify, and distribute this program for 
+     any purpose, with or without fee, is hereby granted, provided that
+     the notices on the head, the reference information, and this
+     copyright notice appear in all copies or substantial portions of 
+     the Software. It is provided "as is" without express or implied 
+     warranty.
diff --git a/api/compute.sh b/api/compute.sh
@@ -0,0 +1,64 @@
+#!/bin/bash
+
+QUERY_ID=$1
+# <protein_id>,<protein_id>,...
+PROTEINS=$2
+# True/False
+CACHE_RESULT=$3
+# How many proteins we're computing scores for
+LIMIT=$4
+# Which protein to start from
+OFFSET=$5
+
+RESULTS_FOLDER='/eph/results'
+
+mkdir -p /eph/results
+mkdir -p /eph/initiated_computations
+
+# Mark that this process is computing the query
+touch "/eph/initiated_computations/${QUERY_ID}"
+
+# Create temporary folder
+mkdir -p "/eph/partial_scores/${QUERY_ID}"
+
+# Extract database proteins
+IFS=',' read -ra DATASET_IDS <<<"${PROTEINS}"
+
+# Compute scores for a single query protein and a single database protein
+compute_scores() {
+    QUERY_ID=$1
+    PROTEIN_ID=$2
+
+    QUERY_PROTEIN_PATH="/data/cifs/AF-${QUERY_ID}-F1-model_v3.cif"
+    DATASET_EXCTRACTED_PROTEIN_PATH="/data/cifs/AF-${PROTEIN_ID}-F1-model_v3.cif"
+
+    /home/alphafind/USalign "${QUERY_PROTEIN_PATH}" "${DATASET_EXCTRACTED_PROTEIN_PATH}" -outfmt 2 | tail -n 1 | awk -F ' ' '{print $3,$5,$8,$9,$11}' >"/eph/partial_scores/${QUERY_ID}/${PROTEIN_ID}.txt"
+}
+export -f compute_scores
+
+# Sets the default number of parallel jobs if not specified otherwise
+if [[ -z "${N_PARALLEL_JOBS}" ]]; then
+    N_PARALLEL_JOBS=20
+fi
+
+# Run at most N_PARALLEL_JOBS jobs in parallel
+parallel --jobs "${N_PARALLEL_JOBS}" "compute_scores ${QUERY_ID} {}" ::: "${DATASET_IDS[@]}"
+
+N_DATASET_PROTEINS="${#DATASET_IDS[@]}"
+
+# Merge results
+for ((i = 0; i < N_DATASET_PROTEINS; i++)); do
+    PROTEIN_ID="${DATASET_IDS[i]}"
+
+    if [[ "${CACHE_RESULT}" == "True" ]]; then
+        echo -n "${PROTEIN_ID} " | tee -a "${RESULTS_FOLDER}/${QUERY_ID}-limit=${LIMIT}-offset=${OFFSET}.txt"
+        tee -a "${RESULTS_FOLDER}/${QUERY_ID}-limit=${LIMIT}-offset=${OFFSET}.txt" <"/eph/partial_scores/${QUERY_ID}/${PROTEIN_ID}.txt"
+    else
+        echo -n "${PROTEIN_ID} "
+        cat "/eph/partial_scores/${QUERY_ID}/${PROTEIN_ID}.txt"
+    fi
+done
+
+# Remove temporary files
+rm -r "/eph/partial_scores/${QUERY_ID}"
+rm "/eph/initiated_computations/${QUERY_ID}"
diff --git a/api/eph/.gitkeep b/api/eph/.gitkeep