Skip to content

Commit

Permalink
Added AlphaFind
Browse files Browse the repository at this point in the history
  • Loading branch information
ProchazkaDavid committed Jan 30, 2024
1 parent f5568ac commit 5fa6d27
Show file tree
Hide file tree
Showing 249 changed files with 329,228 additions and 0 deletions.
6 changes: 6 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Set the default behavior for all files
* text=auto

# Explicitly declare files that will always have Unix-style line endings
*.sh text eol=lf
*.py text eol=lf
39 changes: 39 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
<div align="center">
<br>
<br>
<a href="https://github.com/Coda-Research-Group/AlphaFind"><img src="https://raw.githubusercontent.com/Coda-Research-Group/AlphaFind/main/static/logo.png" alt="AlphaCharges" width="220"></a>
<br>
<br>
</div>

# AlphaFind: Discover structure similarity across the entire known proteome

**[AlphaFind](https://alphafind.fi.muni.cz)** is a web-based search engine that allows for structure-based search of the entire [AlphaFold Protein Structure Database](https://alphafold.ebi.ac.uk). Uniprot ID, PDB ID, or Gene Symbol is accepted as input – the engine will return the most similar proteins found within AlphaFold DB, with an option for additional search to extend and refine the results. The search results are grouped by their source organism and displayed along with several similarity metrics. 3D visualizations of the structural superposition of the proteins are provided, and text filters can be used to find specific organisms or Uniprot IDs. For details about the methodology and usage, please see the [manual](https://github.com/Coda-Research-Group/AlphaFind/wiki/Manual). This website is free and open to all users and there is no login requirement.

Vector embeddings and model weights used in [AlphaFind](https://alphafind.fi.muni.cz) are available at [AlphaFind: Discover structure similarity across the entire known proteome – data and model | Czech national repository](https://data.narodni-repozitar.cz/general/datasets/d35zf-1ja47).
This project uses [USalign](https://github.com/pylelab/USalign).

## Code Structure

The codebase is divided into three folders:
- `training` (model training, index building)
- `api` (backend)
- `ui` (frontend)

See the `README.md` files in each folder for more details.

## Running locally

Prerequisites:
- [Docker](https://docs.docker.com/get-docker/)

1. Clone this repository
2. Run `./run.sh` in your terminal
3. Open `http://localhost:8081` in your browser

The `training/data/cifs` folder contains a small subset of the AlphaFold DB comprising 109 proteins.
The full AlphaFold DB can be downloaded from [here](https://alphafold.ebi.ac.uk/download).

## License

MIT license
4 changes: 4 additions & 0 deletions api/.flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[flake8]
max-line-length = 120
max-complexity = 7
extend-ignore = E203
12 changes: 12 additions & 0 deletions api/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
__pycache__/
.ipynb_checkpoints/
wandb/
.idea/
config_.yaml
pod_.yaml
secret.yaml
kubectl
*.h5
data/
eph/*/
models/
18 changes: 18 additions & 0 deletions api/.pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
repos:
- repo: https://github.com/psf/black
rev: 23.7.0
hooks:
- id: black
args:
- --check
- --diff
- repo: https://github.com/PyCQA/isort
rev: 5.12.0
hooks:
- id: isort
args:
- --check-only
- repo: https://github.com/PyCQA/flake8
rev: 6.1.0
hooks:
- id: flake8
45 changes: 45 additions & 0 deletions api/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# AlphaFind API

[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit)

This project uses [USalign](https://github.com/pylelab/USalign).
Vector embeddings and model weights used in [AlphaFind](https://alphafind.fi.muni.cz) are available at [AlphaFind: Discover structure similarity across the entire known proteome – data and model | Czech national repository](https://data.narodni-repozitar.cz/general/datasets/d35zf-1ja47).

## Running locally

1. Copy folders `data` and `models` from `alphafind-training` to the root of this repository.

```shell
ln -s ../alphafind-training/models/ models/
ln -s ../alphafind-training/data/ data/
```

2. Run the following commands:

```shell
# Build the server image
docker build -t alphafind:server -f ./server/Dockerfile .

# Run the server
docker run -p 8080:8000 \
-v ./data:/data \
-v ./models:/models \
-v ./eph:/eph \
alphafind:server

Note: On **Windows** you may need to use absolute paths instead of relative paths.

# Example query
curl 'http://localhost:8080/search?query=A0A0C5PVI1'
```

## Installing dependencies

```shell
# Production environment
pip install -r requirements.txt

# Development environment
pip install -r requirements-dev.txt
pre-commit install
```
Binary file added api/USalign
Binary file not shown.
15 changes: 15 additions & 0 deletions api/USalign_LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
US-align: universal structure alignment of monomeric and complex proteins
and nucleic acids

References to cite:
(1) Chengxin Zhang, Morgan Shine, Anna Marie Pyle, Yang Zhang
(2022) Nat Methods. 19(9), 1109-1115.
(2) Chengxin Zhang, Anna Marie Pyle (2022) iScience. 25(10), 105218.

DISCLAIMER:
Permission to use, copy, modify, and distribute this program for
any purpose, with or without fee, is hereby granted, provided that
the notices on the head, the reference information, and this
copyright notice appear in all copies or substantial portions of
the Software. It is provided "as is" without express or implied
warranty.
64 changes: 64 additions & 0 deletions api/compute.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
#!/bin/bash

QUERY_ID=$1
# <protein_id>,<protein_id>,...
PROTEINS=$2
# True/False
CACHE_RESULT=$3
# How many proteins we're computing scores for
LIMIT=$4
# Which protein to start from
OFFSET=$5

RESULTS_FOLDER='/eph/results'

mkdir -p /eph/results
mkdir -p /eph/initiated_computations

# Mark that this process is computing the query
touch "/eph/initiated_computations/${QUERY_ID}"

# Create temporary folder
mkdir -p "/eph/partial_scores/${QUERY_ID}"

# Extract database proteins
IFS=',' read -ra DATASET_IDS <<<"${PROTEINS}"

# Compute scores for a single query protein and a single database protein
compute_scores() {
QUERY_ID=$1
PROTEIN_ID=$2

QUERY_PROTEIN_PATH="/data/cifs/AF-${QUERY_ID}-F1-model_v3.cif"
DATASET_EXCTRACTED_PROTEIN_PATH="/data/cifs/AF-${PROTEIN_ID}-F1-model_v3.cif"

/home/alphafind/USalign "${QUERY_PROTEIN_PATH}" "${DATASET_EXCTRACTED_PROTEIN_PATH}" -outfmt 2 | tail -n 1 | awk -F ' ' '{print $3,$5,$8,$9,$11}' >"/eph/partial_scores/${QUERY_ID}/${PROTEIN_ID}.txt"
}
export -f compute_scores

# Sets the default number of parallel jobs if not specified otherwise
if [[ -z "${N_PARALLEL_JOBS}" ]]; then
N_PARALLEL_JOBS=20
fi

# Run at most N_PARALLEL_JOBS jobs in parallel
parallel --jobs "${N_PARALLEL_JOBS}" "compute_scores ${QUERY_ID} {}" ::: "${DATASET_IDS[@]}"

N_DATASET_PROTEINS="${#DATASET_IDS[@]}"

# Merge results
for ((i = 0; i < N_DATASET_PROTEINS; i++)); do
PROTEIN_ID="${DATASET_IDS[i]}"

if [[ "${CACHE_RESULT}" == "True" ]]; then
echo -n "${PROTEIN_ID} " | tee -a "${RESULTS_FOLDER}/${QUERY_ID}-limit=${LIMIT}-offset=${OFFSET}.txt"
tee -a "${RESULTS_FOLDER}/${QUERY_ID}-limit=${LIMIT}-offset=${OFFSET}.txt" <"/eph/partial_scores/${QUERY_ID}/${PROTEIN_ID}.txt"
else
echo -n "${PROTEIN_ID} "
cat "/eph/partial_scores/${QUERY_ID}/${PROTEIN_ID}.txt"
fi
done

# Remove temporary files
rm -r "/eph/partial_scores/${QUERY_ID}"
rm "/eph/initiated_computations/${QUERY_ID}"
Empty file added api/eph/.gitkeep
Empty file.
Loading

0 comments on commit 5fa6d27

Please sign in to comment.