From 14e1f4eeb9c6c64731ad41f999b7d6125877c83f Mon Sep 17 00:00:00 2001 From: evelynmitchell Date: Sun, 12 May 2024 20:12:56 -0600 Subject: [PATCH] Update README with refs from meeting 20240512 --- README.md | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 76 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index ea29e39..cf5a127 100644 --- a/README.md +++ b/README.md @@ -119,47 +119,83 @@ print(output.shape) # Notes -> pairwise representation -> explicit atomic positions + -> within the trunk, msa processing is de emphasized with a simpler MSA block, 4 blocks + -> msa processing -> pair weighted averaging + -> pairformer: replaces evoformer, operates on pair representation and single representation + -> pairformer 48 blocks + -> pair and single representation together with the input representation are passed to the diffusion module + -> diffusion takes in 3 tensors [pair, single representation, with new pairformer representation] + -> diffusion module operates directory on raw atom coordinates + -> standard diffusion approach, model is trained to receiev noised atomic coordinates then predict the true coordinates + -> the network learns protein structure at a variety of length scales where the denoising task at small noise emphasizes large scale structure of the system. + -> at inference time, random noise is sampled and then recurrently denoised to produce a final structure + -> diffusion module produces a distribution of answers + -> for each answer the local structure will be sharply defined + -> diffusion models are prone to hallucination where the model may hallucinate plausible looking structures + -> to counteract hallucination, they use a novel cross distillation method where they enrich the training data with alphafold multimer v2.3 predicted strutctures. + -> confidence measures predicts the atom level and pairwise errors in final structures, this is done by regressing the error in the outut of the structure mdule in training, --> Utilizes diffusion rollout procedure for the full structure generation during training ( using a largeer step suze than normal) + +-> Utilizes diffusion rollout procedure for the full structure generation during training ( using a larger step suze than normal) + -> diffused predicted structure is used to permute the ground truth and ligands to compute metrics to train the confidence head. + -> confidence head uses the pairwise representation to predict the lddt (pddt) and a predicted aligned error matrix as used in alphafold 2 as well as distance error matrix which is the error in the distance matrix of the predicted structure as compared to the true structure + -> confidence measures also preduct atom level and pairwise errors + -> early stopping using a weighted average of all above metic + -> af3 can predict srtructures from input polymer sequences, rediue modifications, ligand smiles + -> uses structures below 1000 residues + -> alphafold3 is able to predict protein nuclear structures with thousnads of residues + -> Covalent modifications (bonded ligands, glycosylation, and modified protein residues and 202 nucleic acid bases) are also accurately predicted by AF + -> distills alphafold2 preductions + -> key problem in protein structure prediction is they predict static structures and not the dynamical behavior + -> multiple random seeds for either the diffusion head or network does not product an approximation of the solution ensenble + -> in future: generate large number of predictions and rank them + -> inference: top confidence sample from 5 seed runs and 5 diffusion samples per model seed for a total of 25 samples + -> interface accuracy via interface lddt which is calculated from distances netween atoms across different chains in the interface + -> uses a lddt to polymer metric which considers differences from each atom of a entity to any c or c1 polymer atom within aradius # Todo ## Model Architecture -- Implement input Embedder from Alphafold2 openfold implementation [LINK](https://github.com/aqlaboratory/openfold) +- Implement input Embedder from Alphafold2 openfold +implementation [LINK](https://github.com/aqlaboratory/openfold) + - Implement the template module from openfold [LINK](https://github.com/aqlaboratory/openfold) + - Implement the MSA embedding from openfold [LINK](https://github.com/aqlaboratory/openfold) + - Fix residuals and make sure pair representation and generated output goes into the diffusion model + - Implement reclying to fix residuals @@ -169,4 +205,41 @@ print(output.shape) # Resources - [ EvoFormer Paper ](https://www.nature.com/articles/s41586-021-03819-2) - [ Pairformer](https://arxiv.org/pdf/2311.03583) -- [ AlphaFold 3 Paper](https://www.nature.com/articles/s41586-024-07487-w) \ No newline at end of file +- [ AlphaFold 3 Paper](https://www.nature.com/articles/s41586-024-07487-w) + +- [OpenFold](https://github.com/aqlaboratory/openfold) + + +## Datasets +Smaller, start here +- [Protein data bank](https://www.rcsb.org/) +- [Working with pdb data](https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/dealing-with-coordinates) +- [PDB ligands](https://huggingface.co/datasets/jglaser/pdb_protein_ligand_complexes) + +Much larger, for verification +- [AlphaFold Protein Structure Database](https://alphafold.ebi.ac.uk/) +- [Colab notebook for AlphaFold search](https://colab.research.google.com/github/deepmind/alphafold/blob/main/notebooks/AlphaFold.ipynb) + +## Benchmarks + +- [RoseTTAFold](https://www.biorxiv.org/content/10.1101/2021.08.15.456425v1)(https://www.ipd.uw.edu/2021/07/rosettafold-accurate-protein-structure-prediction-accessible-to-all/0) + +## Related Projects + +- [NeuroFold](https://www.biorxiv.org/content/10.1101/2024.03.12.584504v1) + +## Tools + +- [HaskTorch - Haskell bindings for PyTorch](https://github.com/hasktorch/hasktorch) + +- [PyMol](https://pymol.org/) +- [ChimeraX](https://www.cgl.ucsf.edu/chimerax/download.html) + +## Community + +- [OpenBioML](https://discord.gg/nh3gCv6b) + +## Classics + +- [Thinking in Systems](https://www.chelseagreen.com/product/thinking-in-systems/) +