Skip to content

Commit

Permalink
Merge pull request #4 from evelynmitchell/main
Browse files Browse the repository at this point in the history
  • Loading branch information
kyegomez authored May 13, 2024
2 parents df48b76 + 14e1f4e commit 2c6ebd9
Showing 1 changed file with 76 additions and 3 deletions.
79 changes: 76 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,47 +119,83 @@ print(output.shape)

# Notes
-> pairwise representation -> explicit atomic positions

-> within the trunk, msa processing is de emphasized with a simpler MSA block, 4 blocks

-> msa processing -> pair weighted averaging

-> pairformer: replaces evoformer, operates on pair representation and single representation

-> pairformer 48 blocks

-> pair and single representation together with the input representation are passed to the diffusion module

-> diffusion takes in 3 tensors [pair, single representation, with new pairformer representation]

-> diffusion module operates directory on raw atom coordinates

-> standard diffusion approach, model is trained to receiev noised atomic coordinates then predict the true coordinates

-> the network learns protein structure at a variety of length scales where the denoising task at small noise emphasizes large scale structure of the system.

-> at inference time, random noise is sampled and then recurrently denoised to produce a final structure

-> diffusion module produces a distribution of answers

-> for each answer the local structure will be sharply defined

-> diffusion models are prone to hallucination where the model may hallucinate plausible looking structures

-> to counteract hallucination, they use a novel cross distillation method where they enrich the training data with alphafold multimer v2.3 predicted strutctures.

-> confidence measures predicts the atom level and pairwise errors in final structures, this is done by regressing the error in the outut of the structure mdule in training,
-> Utilizes diffusion rollout procedure for the full structure generation during training ( using a largeer step suze than normal)

-> Utilizes diffusion rollout procedure for the full structure generation during training ( using a larger step suze than normal)

-> diffused predicted structure is used to permute the ground truth and ligands to compute metrics to train the confidence head.

-> confidence head uses the pairwise representation to predict the lddt (pddt) and a predicted aligned error matrix as used in alphafold 2 as well as distance error matrix which is the error in the distance matrix of the predicted structure as compared to the true structure

-> confidence measures also preduct atom level and pairwise errors

-> early stopping using a weighted average of all above metic

-> af3 can predict srtructures from input polymer sequences, rediue modifications, ligand smiles

-> uses structures below 1000 residues

-> alphafold3 is able to predict protein nuclear structures with thousnads of residues

-> Covalent modifications (bonded ligands, glycosylation, and modified protein residues and
202 nucleic acid bases) are also accurately predicted by AF

-> distills alphafold2 preductions

-> key problem in protein structure prediction is they predict static structures and not the dynamical behavior

-> multiple random seeds for either the diffusion head or network does not product an approximation of the solution ensenble

-> in future: generate large number of predictions and rank them

-> inference: top confidence sample from 5 seed runs and 5 diffusion samples per model seed for a total of 25 samples

-> interface accuracy via interface lddt which is calculated from distances netween atoms across different chains in the interface

-> uses a lddt to polymer metric which considers differences from each atom of a entity to any c or c1 polymer atom within aradius


# Todo

## Model Architecture
- Implement input Embedder from Alphafold2 openfold implementation [LINK](https://github.com/aqlaboratory/openfold)
- Implement input Embedder from Alphafold2 openfold
implementation [LINK](https://github.com/aqlaboratory/openfold)

- Implement the template module from openfold [LINK](https://github.com/aqlaboratory/openfold)

- Implement the MSA embedding from openfold [LINK](https://github.com/aqlaboratory/openfold)

- Fix residuals and make sure pair representation and generated output goes into the diffusion model

- Implement reclying to fix residuals


Expand All @@ -169,4 +205,41 @@ print(output.shape)
# Resources
- [ EvoFormer Paper ](https://www.nature.com/articles/s41586-021-03819-2)
- [ Pairformer](https://arxiv.org/pdf/2311.03583)
- [ AlphaFold 3 Paper](https://www.nature.com/articles/s41586-024-07487-w)
- [ AlphaFold 3 Paper](https://www.nature.com/articles/s41586-024-07487-w)

- [OpenFold](https://github.com/aqlaboratory/openfold)


## Datasets
Smaller, start here
- [Protein data bank](https://www.rcsb.org/)
- [Working with pdb data](https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/dealing-with-coordinates)
- [PDB ligands](https://huggingface.co/datasets/jglaser/pdb_protein_ligand_complexes)

Much larger, for verification
- [AlphaFold Protein Structure Database](https://alphafold.ebi.ac.uk/)
- [Colab notebook for AlphaFold search](https://colab.research.google.com/github/deepmind/alphafold/blob/main/notebooks/AlphaFold.ipynb)

## Benchmarks

- [RoseTTAFold](https://www.biorxiv.org/content/10.1101/2021.08.15.456425v1)(https://www.ipd.uw.edu/2021/07/rosettafold-accurate-protein-structure-prediction-accessible-to-all/0)

## Related Projects

- [NeuroFold](https://www.biorxiv.org/content/10.1101/2024.03.12.584504v1)

## Tools

- [HaskTorch - Haskell bindings for PyTorch](https://github.com/hasktorch/hasktorch)

- [PyMol](https://pymol.org/)
- [ChimeraX](https://www.cgl.ucsf.edu/chimerax/download.html)

## Community

- [OpenBioML](https://discord.gg/nh3gCv6b)

## Classics

- [Thinking in Systems](https://www.chelseagreen.com/product/thinking-in-systems/)

0 comments on commit 2c6ebd9

Please sign in to comment.