Merge pull request #4 from evelynmitchell/main

kyegomez · May 13, 2024 · 2c6ebd9 · 2c6ebd9
2 parents df48b76 + 14e1f4e
commit 2c6ebd9
Showing 1 changed file with 76 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -119,47 +119,83 @@ print(output.shape)
 
 # Notes
 -> pairwise representation -> explicit atomic positions
+
 -> within the trunk, msa processing is de emphasized with a simpler MSA block, 4 blocks
+
 -> msa processing -> pair weighted averaging 
+
 -> pairformer: replaces evoformer, operates on pair representation and single representation
+
 -> pairformer 48 blocks
+
 -> pair and single representation together with the input representation are passed to the diffusion module
+
 -> diffusion takes in 3 tensors [pair, single representation, with new pairformer representation]
+
 -> diffusion module operates directory on raw atom coordinates
+
 -> standard diffusion approach, model is trained to receiev noised atomic coordinates then predict the true coordinates
+
 -> the network learns protein structure at a variety of length scales where the denoising task at small noise emphasizes large scale structure of the system.
+
 -> at inference time, random noise is sampled and then recurrently denoised to produce a final structure
+
 -> diffusion module produces a distribution of answers
+
 -> for each answer the local structure will be sharply defined
+
 -> diffusion models are prone to hallucination where the model may hallucinate plausible looking structures
+
 -> to counteract hallucination, they use a novel cross distillation method where they enrich the training data with alphafold multimer v2.3 predicted strutctures. 
+
 -> confidence measures predicts the atom level and pairwise errors in final structures, this is done by regressing the error in the outut of the structure mdule in training,
--> Utilizes diffusion rollout procedure for the full structure generation during training ( using a largeer step suze than normal)
+
+-> Utilizes diffusion rollout procedure for the full structure generation during training ( using a larger step suze than normal)
+
 -> diffused predicted structure is used to permute the ground truth and ligands to compute metrics to train the confidence head.
+
 -> confidence head uses the pairwise representation to predict the lddt (pddt) and a predicted aligned error matrix as used in alphafold 2 as well as distance error matrix which is the error in the distance matrix of the predicted structure as compared to the true structure
+
 -> confidence measures also preduct atom level and pairwise errors
+
 -> early stopping using a weighted average of all above metic
+
 -> af3 can predict srtructures from input polymer sequences, rediue modifications, ligand smiles
+
 -> uses structures below 1000 residues
+
 -> alphafold3 is able to predict protein nuclear structures with thousnads of residues
+
 -> Covalent modifications (bonded ligands, glycosylation, and modified protein residues and
 202 nucleic acid bases) are also accurately predicted by AF
+
 -> distills alphafold2 preductions
+
 -> key problem in protein structure prediction is they predict static structures and not the dynamical behavior
+
 -> multiple random seeds for either the diffusion head or network does not product an approximation of the solution ensenble
+
 -> in future: generate large number of predictions and rank them
+
 -> inference: top confidence sample from 5 seed runs and 5 diffusion samples per model seed for a total of 25 samples
+
 -> interface accuracy via interface lddt which is calculated from distances netween atoms across different chains in the interface
+
 -> uses a lddt to polymer metric which considers differences from each atom of a entity to any c or c1 polymer atom within  aradius
 
 
 # Todo
 
 ## Model Architecture
-- Implement input Embedder from Alphafold2 openfold implementation [LINK](https://github.com/aqlaboratory/openfold)
+- Implement input Embedder from Alphafold2 openfold 
+implementation [LINK](https://github.com/aqlaboratory/openfold)
+
 - Implement the template module from openfold [LINK](https://github.com/aqlaboratory/openfold)
+
 - Implement the MSA embedding from openfold [LINK](https://github.com/aqlaboratory/openfold)
+
 - Fix residuals and make sure pair representation and generated output goes into the diffusion model
+
 - Implement reclying to fix residuals
 
 
@@ -169,4 +205,41 @@ print(output.shape)
 # Resources
 - [ EvoFormer Paper ](https://www.nature.com/articles/s41586-021-03819-2)
 - [ Pairformer](https://arxiv.org/pdf/2311.03583)
-- [ AlphaFold 3 Paper](https://www.nature.com/articles/s41586-024-07487-w)
+- [ AlphaFold 3 Paper](https://www.nature.com/articles/s41586-024-07487-w)
+
+- [OpenFold](https://github.com/aqlaboratory/openfold)
+
+
+## Datasets
+Smaller, start here
+- [Protein data bank](https://www.rcsb.org/)
+- [Working with pdb data](https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/dealing-with-coordinates)
+- [PDB ligands](https://huggingface.co/datasets/jglaser/pdb_protein_ligand_complexes)
+
+Much larger, for verification
+- [AlphaFold Protein Structure Database](https://alphafold.ebi.ac.uk/)
+- [Colab notebook for AlphaFold search](https://colab.research.google.com/github/deepmind/alphafold/blob/main/notebooks/AlphaFold.ipynb)
+
+## Benchmarks
+
+- [RoseTTAFold](https://www.biorxiv.org/content/10.1101/2021.08.15.456425v1)(https://www.ipd.uw.edu/2021/07/rosettafold-accurate-protein-structure-prediction-accessible-to-all/0)
+
+## Related Projects
+
+- [NeuroFold](https://www.biorxiv.org/content/10.1101/2024.03.12.584504v1)
+
+## Tools
+
+- [HaskTorch - Haskell bindings for PyTorch](https://github.com/hasktorch/hasktorch)
+
+- [PyMol](https://pymol.org/)
+- [ChimeraX](https://www.cgl.ucsf.edu/chimerax/download.html)
+
+## Community
+
+- [OpenBioML](https://discord.gg/nh3gCv6b)
+
+## Classics 
+
+- [Thinking in Systems](https://www.chelseagreen.com/product/thinking-in-systems/)
+