Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AlkEthOH interaction-typing task #20

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from
Draft

AlkEthOH interaction-typing task #20

wants to merge 17 commits into from

Conversation

maxentile
Copy link
Member

@maxentile maxentile commented May 28, 2020

Add datasets for tasks of classifying atom/bond/angle/torsion types for molecules in AlkEthOH rings set, and provide simple rule-based baselines for each task.

Done:

  • Add script to download AlkEthOH rings dataset, label atom- and interaction-types using OpenFF 1.0.0 Parsley forcefield, and save discrete labels. (Tracked using Git LFS.)
  • Add PyTorch Dataset interfaces to AlkEthOH{Atom|Bond|Angle|Torsion}TypesDatasets

Todo:

  • Update paths from hfgp to espaloma
  • Update tests, make sure resources can be found (currently using relative paths, should use pkg_resources.resource_filenames)
  • Discuss with @yuanqing-wang whether PyTorch Dataset interface is suitable, make adjustments

@jchodera
Copy link
Member

Add script to download AlkEthOH rings dataset

What's the rationale behind just using the rings? Limiting the number of compounds?

@maxentile
Copy link
Member Author

What's the rationale behind just using the rings? Limiting the number of compounds?

No rationale, just a starting point -- intention is still to use also the AlkEthOH chains set and other sets listed in #2 (comment) and #2 (comment) .

@maxentile
Copy link
Member Author

Hmm, although the Pytorch views in https://github.com/choderalab/espaloma/blob/973d5e1de00b60390b93a054c4277db632569b04/espaloma/data/alkethoh/pytorch_datasets.py satisfy the pytorch Dataset interface, they don't yet play nice with DataLoader.

For example,

import torch
dataset = AlkEthOHAtomTypesDataset()
loader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True)

runs into this error:

TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'openforcefield.topology.molecule.Molecule'>

Possible workarounds:

  • define a non-default collate function that handles OpenFF Molecules
  • replace OpenFF Molecule with something that default_collate knows what to do with, such as a dict or a dgl graph containing similar information

@jchodera
Copy link
Member

jchodera commented Jun 1, 2020

cc @jaimergp @t-kimber on the dataset issue above.

* AlkEthOHDataset knows how to load things from disk
* AlkEthOHTypesDataset(AlkEthOHDataset) knows how to compute categorical loss
* AlkEthOH{Atom|Bond|Angle|Torsion}TypesDataset's index into the appropriate types
@maxentile maxentile mentioned this pull request Jun 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants