-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
f5568ac
commit 5fa6d27
Showing
249 changed files
with
329,228 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
# Set the default behavior for all files | ||
* text=auto | ||
|
||
# Explicitly declare files that will always have Unix-style line endings | ||
*.sh text eol=lf | ||
*.py text eol=lf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
<div align="center"> | ||
<br> | ||
<br> | ||
<a href="https://github.com/Coda-Research-Group/AlphaFind"><img src="https://raw.githubusercontent.com/Coda-Research-Group/AlphaFind/main/static/logo.png" alt="AlphaCharges" width="220"></a> | ||
<br> | ||
<br> | ||
</div> | ||
|
||
# AlphaFind: Discover structure similarity across the entire known proteome | ||
|
||
**[AlphaFind](https://alphafind.fi.muni.cz)** is a web-based search engine that allows for structure-based search of the entire [AlphaFold Protein Structure Database](https://alphafold.ebi.ac.uk). Uniprot ID, PDB ID, or Gene Symbol is accepted as input – the engine will return the most similar proteins found within AlphaFold DB, with an option for additional search to extend and refine the results. The search results are grouped by their source organism and displayed along with several similarity metrics. 3D visualizations of the structural superposition of the proteins are provided, and text filters can be used to find specific organisms or Uniprot IDs. For details about the methodology and usage, please see the [manual](https://github.com/Coda-Research-Group/AlphaFind/wiki/Manual). This website is free and open to all users and there is no login requirement. | ||
|
||
Vector embeddings and model weights used in [AlphaFind](https://alphafind.fi.muni.cz) are available at [AlphaFind: Discover structure similarity across the entire known proteome – data and model | Czech national repository](https://data.narodni-repozitar.cz/general/datasets/d35zf-1ja47). | ||
This project uses [USalign](https://github.com/pylelab/USalign). | ||
|
||
## Code Structure | ||
|
||
The codebase is divided into three folders: | ||
- `training` (model training, index building) | ||
- `api` (backend) | ||
- `ui` (frontend) | ||
|
||
See the `README.md` files in each folder for more details. | ||
|
||
## Running locally | ||
|
||
Prerequisites: | ||
- [Docker](https://docs.docker.com/get-docker/) | ||
|
||
1. Clone this repository | ||
2. Run `./run.sh` in your terminal | ||
3. Open `http://localhost:8081` in your browser | ||
|
||
The `training/data/cifs` folder contains a small subset of the AlphaFold DB comprising 109 proteins. | ||
The full AlphaFold DB can be downloaded from [here](https://alphafold.ebi.ac.uk/download). | ||
|
||
## License | ||
|
||
MIT license |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
[flake8] | ||
max-line-length = 120 | ||
max-complexity = 7 | ||
extend-ignore = E203 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
__pycache__/ | ||
.ipynb_checkpoints/ | ||
wandb/ | ||
.idea/ | ||
config_.yaml | ||
pod_.yaml | ||
secret.yaml | ||
kubectl | ||
*.h5 | ||
data/ | ||
eph/*/ | ||
models/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
repos: | ||
- repo: https://github.com/psf/black | ||
rev: 23.7.0 | ||
hooks: | ||
- id: black | ||
args: | ||
- --check | ||
- --diff | ||
- repo: https://github.com/PyCQA/isort | ||
rev: 5.12.0 | ||
hooks: | ||
- id: isort | ||
args: | ||
- --check-only | ||
- repo: https://github.com/PyCQA/flake8 | ||
rev: 6.1.0 | ||
hooks: | ||
- id: flake8 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
# AlphaFind API | ||
|
||
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit) | ||
|
||
This project uses [USalign](https://github.com/pylelab/USalign). | ||
Vector embeddings and model weights used in [AlphaFind](https://alphafind.fi.muni.cz) are available at [AlphaFind: Discover structure similarity across the entire known proteome – data and model | Czech national repository](https://data.narodni-repozitar.cz/general/datasets/d35zf-1ja47). | ||
|
||
## Running locally | ||
|
||
1. Copy folders `data` and `models` from `alphafind-training` to the root of this repository. | ||
|
||
```shell | ||
ln -s ../alphafind-training/models/ models/ | ||
ln -s ../alphafind-training/data/ data/ | ||
``` | ||
|
||
2. Run the following commands: | ||
|
||
```shell | ||
# Build the server image | ||
docker build -t alphafind:server -f ./server/Dockerfile . | ||
|
||
# Run the server | ||
docker run -p 8080:8000 \ | ||
-v ./data:/data \ | ||
-v ./models:/models \ | ||
-v ./eph:/eph \ | ||
alphafind:server | ||
|
||
Note: On **Windows** you may need to use absolute paths instead of relative paths. | ||
|
||
# Example query | ||
curl 'http://localhost:8080/search?query=A0A0C5PVI1' | ||
``` | ||
|
||
## Installing dependencies | ||
|
||
```shell | ||
# Production environment | ||
pip install -r requirements.txt | ||
|
||
# Development environment | ||
pip install -r requirements-dev.txt | ||
pre-commit install | ||
``` |
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
US-align: universal structure alignment of monomeric and complex proteins | ||
and nucleic acids | ||
|
||
References to cite: | ||
(1) Chengxin Zhang, Morgan Shine, Anna Marie Pyle, Yang Zhang | ||
(2022) Nat Methods. 19(9), 1109-1115. | ||
(2) Chengxin Zhang, Anna Marie Pyle (2022) iScience. 25(10), 105218. | ||
|
||
DISCLAIMER: | ||
Permission to use, copy, modify, and distribute this program for | ||
any purpose, with or without fee, is hereby granted, provided that | ||
the notices on the head, the reference information, and this | ||
copyright notice appear in all copies or substantial portions of | ||
the Software. It is provided "as is" without express or implied | ||
warranty. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
#!/bin/bash | ||
|
||
QUERY_ID=$1 | ||
# <protein_id>,<protein_id>,... | ||
PROTEINS=$2 | ||
# True/False | ||
CACHE_RESULT=$3 | ||
# How many proteins we're computing scores for | ||
LIMIT=$4 | ||
# Which protein to start from | ||
OFFSET=$5 | ||
|
||
RESULTS_FOLDER='/eph/results' | ||
|
||
mkdir -p /eph/results | ||
mkdir -p /eph/initiated_computations | ||
|
||
# Mark that this process is computing the query | ||
touch "/eph/initiated_computations/${QUERY_ID}" | ||
|
||
# Create temporary folder | ||
mkdir -p "/eph/partial_scores/${QUERY_ID}" | ||
|
||
# Extract database proteins | ||
IFS=',' read -ra DATASET_IDS <<<"${PROTEINS}" | ||
|
||
# Compute scores for a single query protein and a single database protein | ||
compute_scores() { | ||
QUERY_ID=$1 | ||
PROTEIN_ID=$2 | ||
|
||
QUERY_PROTEIN_PATH="/data/cifs/AF-${QUERY_ID}-F1-model_v3.cif" | ||
DATASET_EXCTRACTED_PROTEIN_PATH="/data/cifs/AF-${PROTEIN_ID}-F1-model_v3.cif" | ||
|
||
/home/alphafind/USalign "${QUERY_PROTEIN_PATH}" "${DATASET_EXCTRACTED_PROTEIN_PATH}" -outfmt 2 | tail -n 1 | awk -F ' ' '{print $3,$5,$8,$9,$11}' >"/eph/partial_scores/${QUERY_ID}/${PROTEIN_ID}.txt" | ||
} | ||
export -f compute_scores | ||
|
||
# Sets the default number of parallel jobs if not specified otherwise | ||
if [[ -z "${N_PARALLEL_JOBS}" ]]; then | ||
N_PARALLEL_JOBS=20 | ||
fi | ||
|
||
# Run at most N_PARALLEL_JOBS jobs in parallel | ||
parallel --jobs "${N_PARALLEL_JOBS}" "compute_scores ${QUERY_ID} {}" ::: "${DATASET_IDS[@]}" | ||
|
||
N_DATASET_PROTEINS="${#DATASET_IDS[@]}" | ||
|
||
# Merge results | ||
for ((i = 0; i < N_DATASET_PROTEINS; i++)); do | ||
PROTEIN_ID="${DATASET_IDS[i]}" | ||
|
||
if [[ "${CACHE_RESULT}" == "True" ]]; then | ||
echo -n "${PROTEIN_ID} " | tee -a "${RESULTS_FOLDER}/${QUERY_ID}-limit=${LIMIT}-offset=${OFFSET}.txt" | ||
tee -a "${RESULTS_FOLDER}/${QUERY_ID}-limit=${LIMIT}-offset=${OFFSET}.txt" <"/eph/partial_scores/${QUERY_ID}/${PROTEIN_ID}.txt" | ||
else | ||
echo -n "${PROTEIN_ID} " | ||
cat "/eph/partial_scores/${QUERY_ID}/${PROTEIN_ID}.txt" | ||
fi | ||
done | ||
|
||
# Remove temporary files | ||
rm -r "/eph/partial_scores/${QUERY_ID}" | ||
rm "/eph/initiated_computations/${QUERY_ID}" |
Empty file.
Oops, something went wrong.