-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scripts to prepare denali dataset for training #118
base: main
Are you sure you want to change the base?
Conversation
This pull request introduces 4 alerts when merging 1be8860 into 4c769be - view on LGTM.com new alerts:
|
@jeherr : Do you know how many molecules are filtered out by this? I wouldn't expect the SMILES from ChEMBL to match the SMILES from OEChem for multiple reasons:
I think the correct approach here is to ignore the ChEMBL SMILES (though you can probably keep that as metadata) and instead just check that all the snapshots for what should be a single protonation/tautomeric species have the same OpenEye perceived SMILES. That is sufficient to guarantee consistency. We can then label the molecule using the Open Force Field canonical isomeric tagged smiles with # Convert to OpenFF Molecule
from openff.toolkit.topology import Molecule
offmol = Molecule.from_openeye(oemol)
# Generate tagged canonical isomeric SMILES with explicit hydrogens so we can reconstruct topology
tagged_smiles = offmol.to_smiles(isomeric=True, explicit_hydrogens=True, mapped=True) |
print([(atom.GetAtomicNum(), atom.GetValence()) for atom in tmp_mol.GetAtoms()]) | ||
oechem.OEFindRingAtomsAndBonds(tmp_mol) | ||
print([(atom.GetAtomicNum(), atom.GetValence()) for atom in tmp_mol.GetAtoms()]) | ||
oechem.OEPerceiveBondOrders(tmp_mol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to explicitly perceive aromaticity after this?
Matching the ChEMBL SMILES string only filters about 2000 unique molecules from the dataset. There is something like 98,000 left so it's a pretty insignificant chunk. Even then a fair number of them are due to some minor bug when parsing the data which I don't think is related to matching the SMILES string that I haven't bothered to fix yet. To be clear, at this point I am still only loading the ChEMBL conformers, so there should be no protonated or tautomeric states, correct? I don't perceive aromaticity explicitly, but I believe that is the default behavior of OEFindRingAtomsAndBonds? I could be wrong. I'm not sure if it matters here or not, unless this will change the graph used by espaloma? At a minimum it doesn't matter for matching the ChEMBL SMILES to the coordinates since the vast majority of molecules aren't filtered out by this. |
Just another update to this. If I train on the denali dataset without running It seems like something is off with the nonbonded cutoff and this data. Maybe I should examine some cases and see what is contributing to large discrepancies in the energies for different snapshots of the same molecule? |
I've started looking at single cases to see what is happening. I've put two snapshots of the same molecule which have drastically different energies after subtracting the non-bonded forces below.
and so we can inspect which atoms are experiencing large non-bonded forces I'll past them below
and just to make it easier to discuss here I'll attach a screenshot of each molecule The non-bonded energy for the two snapshots are 102.526 kcal/mol and 228050.667 kcal/mol respectively. Obviously something is wrong here! The large forces are located on atoms towards the end of the tail coming off of the ring, primarily the two terminating carbons, the sulfur, the oxygens, and a few of the hydrogen. It appears rotation around the carbonyl is causing the large change in structure which leads to the drastic change in non-bonded energies. |
Just to add, there is nothing obvious to me why the energy of the second snapshot should be drastically higher than the first snapshot at the QM level, so my intuition is that something is not well described by the force field. |
After downloading the orbnet denali dataset, first I run
make_chembl_dict.py
which grabs all the necessary information from the files provided by the orbnet people.Then
match_smiles.py
is a script which iterates over the dictionary built by the first script and retrieves the canonical SMILES from ChEMBL that corresponds to the ChEMBL ID provided with the dataset. It then attempts to build a graph from the coordinates of the lowest energy snapshot for that molecule. If the SMILES from ChEMBL does not match the SMILES from OEChem determined by creating the graph, then we ignore this molecule. After that all snapshots are checked such that they return the same SMILES string as the minimum energy snapshot. Only snapshots which do not match are thrown away. It also filters out snapshots that are more than 0.1 Hartree higher in energy than the minimum energy snapshot for each molecule.Finally
transform.py
is a small script which subtracts off nonbonded forces from the reference energies. This again filters any snapshots which are greater than 0.1 Hartree in energy higher than the minimum energy snapshot. The result of this script is vastly different energies for most snapshots, and thus the vast majority of the dataset is filtered by this last script.