Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added data conversion scripts. #272

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from
Draft

Added data conversion scripts. #272

wants to merge 7 commits into from

Conversation

frobnitzem
Copy link
Collaborator

@frobnitzem frobnitzem commented Aug 2, 2024

These scripts outline the general format for working will all csv file types.

I will add to this PR as I test these general scripts on multiple data sources.

@frobnitzem
Copy link
Collaborator Author

yaml_to_config.py needs checking to ensure that it's selecting the right variables for compatibility with other datasets used to train the foundational model, and also to check the indexing it uses for graph output properties.

Specifically, it now contains:

      "input_node_feature_names": [
        "atomC",
        "atomF",
        "atomH",
        "atomN",
        "atomO",
        "atomS",
        "atomHg",
        "atomCl",
        "atomicnumber",
        "IsAromatic",
        "HSP",
        "HSP2",
        "HSP3",
        "Hprop"
      ],

But likely we can't use all these. How should yaml_to_config be modified?

* Added get_edge_attribute_name to smiles_utils

* Bugfix for returning 1-hot element names in smiles_utils/graph
  generation

* Made it possible to skip 1-hot element encoding in smiles_utils/graph
  generation

* created TODO list in yaml_to_config.py
@frobnitzem
Copy link
Collaborator Author

frobnitzem commented Aug 14, 2024

Several steps need to be done before this is ready to merge:

  • implement config.json reader according to data.md
    • check that these files are correctly generated by the present PR (utils/yaml_to_config.py)
    • update the output head configuration (see below)
    • test data loading from AdiosDataset with new feature selectors
  • add classification loss head type to HydraGNN
    use sigmoid as activation function at the last layer (only needed if we end with an activation, since torch's cross-entropy converts (-inf, inf) to logits.
    use binary cross entropy as loss function for training, validation, and testing
  • update data loading (from AdiosDataset) so that metadata is used to target which features go into x and y (need to select only some columns when going from data file to in-memory format)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant