scripts to prepare denali dataset for training #118

jeherr · 2022-04-29T18:15:07Z

After downloading the orbnet denali dataset, first I run make_chembl_dict.py which grabs all the necessary information from the files provided by the orbnet people.

Then match_smiles.py is a script which iterates over the dictionary built by the first script and retrieves the canonical SMILES from ChEMBL that corresponds to the ChEMBL ID provided with the dataset. It then attempts to build a graph from the coordinates of the lowest energy snapshot for that molecule. If the SMILES from ChEMBL does not match the SMILES from OEChem determined by creating the graph, then we ignore this molecule. After that all snapshots are checked such that they return the same SMILES string as the minimum energy snapshot. Only snapshots which do not match are thrown away. It also filters out snapshots that are more than 0.1 Hartree higher in energy than the minimum energy snapshot for each molecule.

Finally transform.py is a small script which subtracts off nonbonded forces from the reference energies. This again filters any snapshots which are greater than 0.1 Hartree in energy higher than the minimum energy snapshot. The result of this script is vastly different energies for most snapshots, and thus the vast majority of the dataset is filtered by this last script.

lgtm-com · 2022-04-29T18:49:02Z

This pull request introduces 4 alerts when merging 1be8860 into 4c769be - view on LGTM.com

new alerts:

2 for Except block handles 'BaseException'
1 for Redundant comparison
1 for Unused import

jchodera · 2022-04-29T21:09:02Z

Then match_smiles.py is a script which iterates over the dictionary built by the first script and retrieves the canonical SMILES from ChEMBL that corresponds to the ChEMBL ID provided with the dataset. It then attempts to build a graph from the coordinates of the lowest energy snapshot for that molecule. If the SMILES from ChEMBL does not match the SMILES from OEChem determined by creating the graph, then we ignore this molecule.

@jeherr : Do you know how many molecules are filtered out by this? I wouldn't expect the SMILES from ChEMBL to match the SMILES from OEChem for multiple reasons:

SMILES strings are not unique: Multiple strings map to the same molecule. It looks like you do canonicalize the SMILES with OpenEye OEMolToSmiles() to avoid this.
The same real chemical compound in solution may have multiple different canonical forms that involve different protonation or tautomeric states. You'd also have to pick the same canonical protonation/tautomeric form of the molecules to ensure they're the same. It's not clear to me that you're doing this.
Aromaticity models may differ.

I think the correct approach here is to ignore the ChEMBL SMILES (though you can probably keep that as metadata) and instead just check that all the snapshots for what should be a single protonation/tautomeric species have the same OpenEye perceived SMILES. That is sufficient to guarantee consistency. We can then label the molecule using the Open Force Field canonical isomeric tagged smiles with

# Convert to OpenFF Molecule
from openff.toolkit.topology import Molecule
offmol = Molecule.from_openeye(oemol)
# Generate tagged canonical isomeric SMILES with explicit hydrogens so we can reconstruct topology
tagged_smiles = offmol.to_smiles(isomeric=True, explicit_hydrogens=True, mapped=True)

jchodera · 2022-04-29T21:17:59Z

scripts/denali-dataset/match_smiles.py

+        print([(atom.GetAtomicNum(), atom.GetValence()) for atom in tmp_mol.GetAtoms()])
+        oechem.OEFindRingAtomsAndBonds(tmp_mol)
+        print([(atom.GetAtomicNum(), atom.GetValence()) for atom in tmp_mol.GetAtoms()])
+        oechem.OEPerceiveBondOrders(tmp_mol)


Do you need to explicitly perceive aromaticity after this?

jeherr · 2022-04-29T23:58:05Z

Matching the ChEMBL SMILES string only filters about 2000 unique molecules from the dataset. There is something like 98,000 left so it's a pretty insignificant chunk. Even then a fair number of them are due to some minor bug when parsing the data which I don't think is related to matching the SMILES string that I haven't bothered to fix yet.

To be clear, at this point I am still only loading the ChEMBL conformers, so there should be no protonated or tautomeric states, correct?

I don't perceive aromaticity explicitly, but I believe that is the default behavior of OEFindRingAtomsAndBonds? I could be wrong. I'm not sure if it matters here or not, unless this will change the graph used by espaloma? At a minimum it doesn't matter for matching the ChEMBL SMILES to the coordinates since the vast majority of molecules aren't filtered out by this.

codecov-commenter · 2022-04-30T00:36:07Z

Codecov Report

Merging #118 (4c769be) into master (4c769be) will not change coverage.
The diff coverage is n/a.

❗ Current head 4c769be differs from pull request most recent head 1be8860. Consider uploading reports for the commit 1be8860 to get more accurate results

jeherr · 2022-04-30T01:39:48Z

Just another update to this. If I train on the denali dataset without running subtract_nonbonded_cutoff function, after a short while I am already getting an RMSE on the test set of about 17 kcal/mol. Compared to the same after running the subtract_nonbonded_cutoff I get an RMSE approaching 700,000 kcal/mol. This is similar to the ANI-1 dataset where I got an RMSE around 9 kcal/mol without the nonbonded forces subtracted.

It seems like something is off with the nonbonded cutoff and this data. Maybe I should examine some cases and see what is contributing to large discrepancies in the energies for different snapshots of the same molecule?

jeherr · 2022-05-02T17:25:40Z

I've started looking at single cases to see what is happening. I've put two snapshots of the same molecule which have drastically different energies after subtracting the non-bonded forces below.

Snapshot 1
N     -8.137341  -10.078748    0.727212
S     -9.272774  -10.206999    1.942847
O    -10.192329  -11.231987    1.530874
O     -8.544293  -10.341575    3.171461
C    -10.196381   -8.680454    2.002665
C    -11.572171   -8.767362    1.865210
C    -12.347156   -7.613406    1.927217
C    -13.833607   -7.608441    1.766267
O    -14.441751   -6.603974    1.447415
N    -14.453722   -8.793941    1.979009
C    -15.906223   -8.832888    1.954126
C    -16.430710   -8.911547    0.515027
S    -15.849838  -10.383352   -0.365994
C    -16.338855  -10.024023    2.786574
O    -15.581322  -10.756592    3.361766
O    -17.661348  -10.167334    2.817408
C     -3.663894   -1.413686    0.015590
S     -3.867026    1.282770   -1.983705
C     -4.212282    1.058678   -0.216476
O     -2.969383   -2.351723    0.619511
O     -4.538483   -1.626022   -0.778272
C    -11.717601   -6.380254    2.111839
C    -10.344483   -6.300705    2.254807
C     -9.568659   -7.456325    2.202649
Cl    -7.866182   -7.331511    2.359206
H     -8.603466   -9.957365   -0.159179
H     -7.490783   -9.336768    0.957578
H    -12.018846   -9.739729    1.695387
H    -14.000905   -9.569994    2.443220
H    -16.296905   -7.913486    2.415461
H    -16.136555   -8.000773   -0.009781
H    -17.518777   -8.974446    0.534714
H    -14.553864  -10.144856   -0.194792
H    -17.888951  -10.946867    3.351314
H    -12.325914   -5.482924    2.140825
H     -9.859410   -5.343641    2.403020

Snapshot 2
N      5.281235   -0.390297    1.545866
S      4.188246   -1.312862    0.688565
O      3.637047   -2.258482    1.616260
O      4.868465   -1.765956   -0.490135
C      2.851631   -0.243136    0.177258
C      1.573471   -0.564881    0.593512
C      0.487565    0.207089    0.178140
C     -0.859838   -0.189154    0.671272
O     -0.998203   -0.941722    1.633517
N     -1.914766    0.324512    0.015887
C     -3.281233    0.032451    0.427677
C     -3.663894   -1.413686    0.015590
S     -3.867026    1.282770   -1.983705
C     -4.212282    1.058678   -0.216476
O     -2.969383   -2.351723    0.619511
O     -4.538483   -1.626022   -0.778272
C      0.713964    1.312106   -0.643747
C      1.998994    1.636482   -1.044474
C      3.081762    0.859694   -0.639229
Cl     4.658862    1.303600   -1.135424
H      4.842639   -0.034739    2.381792
H      5.661249    0.323810    0.940430
H      1.408775   -1.421712    1.237476
H     -1.846612    0.753742   -0.903202
H     -4.062904    2.035406    0.244864
H     -4.289576    0.082539   -2.353096
H     -5.247340    0.749656   -0.066097
H     -2.238753   -1.984071    1.191253
H     -3.327854    0.094017    1.523736
H     -0.109725    1.942046   -0.960098
H      2.173969    2.499843   -1.674868

and so we can inspect which atoms are experiencing large non-bonded forces I'll past them below

Snapshot 1
N        1.08713393e-01, -1.80365020e+01, -1.06772583e+01
S       -1.55115469e+00, -1.75309192e+01, -1.81559129e+00
O        2.71087708e+01, -2.92941476e+01, -3.55273107e+00
O       -1.25924874e+00, -1.49534786e+01,  4.35116742e+00
C        1.72694523e+01, -2.54408385e+01, -1.16534965e+00
C       -7.31483839e-01, -1.17438189e+01, -2.97380732e+00
C       -2.50730084e+01, -6.53081686e-01, -2.52240816e+00
C        7.39724133e+00,  2.42169036e+00,  1.33375255e+00
O        4.66518456e+00,  3.16932680e+01, -1.27932012e+01
N       -5.26335684e+00,  3.69518830e+01, -5.93858951e+00
C       -1.67287037e+01, -2.43316501e+01,  5.54309663e+00
C       -3.10616713e+00, -4.18957633e-01, -1.75819347e+01
S       -3.53078258e+00, -8.42044498e+00, -2.37673806e+01
C       -1.74536384e+00, -7.57452027e-01,  1.17303258e+01
O       -1.67907545e+01, -2.98590940e+01,  2.30145397e+01
O       -1.06414537e+01, -1.29624697e+01,  2.26671074e+01
C       -6.05332432e-01,  2.74454141e+01,  5.10051942e+00
C        1.24801373e+01,  2.47739206e+01,  3.94291617e+00
C        2.35665815e+01,  6.49913025e+00,  2.96059208e+00
Cl        1.59499040e+01,  4.15579228e+01,  9.18904995e+00
H        3.10060769e-02, -6.27078584e-03, -2.09195364e-02
H        2.59355243e-01, -1.19402902e+00, -8.54459856e-01
H       -1.39510127e+01,  1.13163442e+01, -4.88075598e-02
H        1.07663862e+00,  7.52303962e-01, -5.35402960e-01
H       -1.26884913e+01, -3.94495217e+00,  6.50557985e+00
H       -6.31769561e+00, -2.00113592e+00, -6.61315273e+00
H       -9.55284990e-01,  4.45933319e+00, -9.60420233e+00
H        1.30681891e-02, -2.53528393e-01, -4.00126939e-01
H        3.25564995e-02, -4.66902995e-04, -1.18590989e-03
H        1.30568649e+01,  1.04302218e+01,  4.31284728e+00
H       -2.07618002e+00,  3.50181041e+00,  2.15014264e-01

Snapshot 2
N        5.46038266e+00, -9.63508576e+00,  1.55716531e+01
S        2.81964001e+00, -1.49633587e+01,  9.52096592e+00
O        2.85787345e+01, -1.53234853e+01,  8.20003093e+00
O        3.44476519e+00, -1.80781107e+01,  3.17356761e+00
C        2.31749002e+01, -1.64301635e+01,  8.55370902e+00
C        1.91586883e+00, -1.22695309e+01,  5.64867443e+00
C       -2.32916278e+01, -5.68863209e+00,  7.47058317e+00
C        1.10379330e+01,  1.28215818e+01,  2.48472211e+00
O        7.21082671e+01,  5.07010108e+01,  5.68877615e+01
N       -2.93001730e-01, -1.28763215e+01,  2.76115499e+01
C       -2.07217226e+01,  8.77622021e+00, -1.04717525e+01
C        8.73345216e+05,  5.22319822e+05,  8.00164449e+05
S        3.01915827e+03,  1.98327480e+03, -1.56077861e+04
C        3.63654868e+05,  1.07514702e+05, -3.77175989e+04
O        1.64081739e+05, -2.78452058e+05,  1.45369037e+05
O       -1.06809378e+06, -2.59358883e+05, -9.69476410e+05
C       -5.58807722e+00,  2.32571184e+01, -1.32056476e+01
C        4.50758663e+00,  2.28002519e+01, -1.69483804e+01
C        2.02299488e+01,  9.93055139e+00, -9.58394346e+00
Cl        2.93157973e+00,  3.71750775e+01, -2.75151677e+01
H        3.09415183e-02,  1.02943739e-02,  1.46294433e-02
H        5.41352282e-01, -4.81166800e-01,  1.06094007e+00
H       -6.09970801e+00,  1.61026559e+00, -3.13332858e+00
H        2.73499004e+00, -9.78060380e-01,  1.39562532e+00
H        2.13423148e+01,  3.96127332e+01,  4.42657342e+01
H        2.44755828e+00,  1.43240134e+01, -3.63964394e+01
H       -3.66718869e+05, -1.09496850e+05,  5.32708185e+04
H        3.05593533e+04,  1.53638623e+04,  2.39299946e+04
H        2.97125216e-02, -2.55892519e-02, -1.31012520e-02
H        7.96437073e+00,  9.40390755e+00, -5.71203800e+00
H       -2.99323150e+00,  2.45890254e+00, -1.38797877e+00

and just to make it easier to discuss here I'll attach a screenshot of each molecule

Snapshot 1

Snapshot 2

The non-bonded energy for the two snapshots are 102.526 kcal/mol and 228050.667 kcal/mol respectively. Obviously something is wrong here! The large forces are located on atoms towards the end of the tail coming off of the ring, primarily the two terminating carbons, the sulfur, the oxygens, and a few of the hydrogen. It appears rotation around the carbonyl is causing the large change in structure which leads to the drastic change in non-bonded energies.

jeherr · 2022-05-02T17:36:35Z

Just to add, there is nothing obvious to me why the energy of the second snapshot should be drastically higher than the first snapshot at the QM level, so my intuition is that something is not well described by the force field.

scripts to prepare denali dataset for training

1be8860

jchodera reviewed Apr 29, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts to prepare denali dataset for training #118

scripts to prepare denali dataset for training #118

jeherr commented Apr 29, 2022

lgtm-com bot commented Apr 29, 2022

jchodera commented Apr 29, 2022 •

edited

Loading

jchodera Apr 29, 2022

jeherr commented Apr 29, 2022

codecov-commenter commented Apr 30, 2022

jeherr commented Apr 30, 2022

jeherr commented May 2, 2022

jeherr commented May 2, 2022

scripts to prepare denali dataset for training #118

Are you sure you want to change the base?

scripts to prepare denali dataset for training #118

Conversation

jeherr commented Apr 29, 2022

lgtm-com bot commented Apr 29, 2022

jchodera commented Apr 29, 2022 • edited Loading

jchodera Apr 29, 2022

Choose a reason for hiding this comment

jeherr commented Apr 29, 2022

codecov-commenter commented Apr 30, 2022

Codecov Report

jeherr commented Apr 30, 2022

jeherr commented May 2, 2022

jeherr commented May 2, 2022

jchodera commented Apr 29, 2022 •

edited

Loading