Added data conversion scripts. #272

frobnitzem · 2024-08-02T00:07:09Z

These scripts outline the general format for working will all csv file types.

I will add to this PR as I test these general scripts on multiple data sources.

frobnitzem · 2024-08-05T21:46:44Z

yaml_to_config.py needs checking to ensure that it's selecting the right variables for compatibility with other datasets used to train the foundational model, and also to check the indexing it uses for graph output properties.

Specifically, it now contains:

      "input_node_feature_names": [
        "atomC",
        "atomF",
        "atomH",
        "atomN",
        "atomO",
        "atomS",
        "atomHg",
        "atomCl",
        "atomicnumber",
        "IsAromatic",
        "HSP",
        "HSP2",
        "HSP3",
        "Hprop"
      ],

But likely we can't use all these. How should yaml_to_config be modified?

* Added get_edge_attribute_name to smiles_utils * Bugfix for returning 1-hot element names in smiles_utils/graph generation * Made it possible to skip 1-hot element encoding in smiles_utils/graph generation * created TODO list in yaml_to_config.py

frobnitzem · 2024-08-14T19:23:21Z

Several steps need to be done before this is ready to merge:

implement config.json reader according to data.md
- check that these files are correctly generated by the present PR (utils/yaml_to_config.py)
- update the output head configuration (see below)
- test data loading from AdiosDataset with new feature selectors
add classification loss head type to HydraGNN
use sigmoid as activation function at the last layer (only needed if we end with an activation, since torch's cross-entropy converts (-inf, inf) to logits.
use binary cross entropy as loss function for training, validation, and testing
update data loading (from AdiosDataset) so that metadata is used to target which features go into x and y (need to select only some columns when going from data file to in-memory format)

frobnitzem added 2 commits August 1, 2024 20:02

Added data conversion scripts.

dd9874d

Successful import_csv and yaml_to_config.

767bea6

frobnitzem added 3 commits August 7, 2024 14:52

Added documentation and dataset model validator.

00fd409

Updates for dataset ingestion.

afa9d59

* Added get_edge_attribute_name to smiles_utils * Bugfix for returning 1-hot element names in smiles_utils/graph generation * Made it possible to skip 1-hot element encoding in smiles_utils/graph generation * created TODO list in yaml_to_config.py

Updated import_csv with additional input validation and pq read ability.

0a03032

kshitij-v-mehta mentioned this pull request Aug 14, 2024

Clintox dataset example #271

Closed

frobnitzem added 2 commits August 21, 2024 15:34

Updated train.py

5829eee

Added config.py parser.

8a8fd50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added data conversion scripts. #272

Added data conversion scripts. #272

frobnitzem commented Aug 2, 2024 •

edited

Loading

frobnitzem commented Aug 5, 2024

frobnitzem commented Aug 14, 2024 •

edited

Loading

Added data conversion scripts. #272

Are you sure you want to change the base?

Added data conversion scripts. #272

Conversation

frobnitzem commented Aug 2, 2024 • edited Loading

frobnitzem commented Aug 5, 2024

frobnitzem commented Aug 14, 2024 • edited Loading

frobnitzem commented Aug 2, 2024 •

edited

Loading

frobnitzem commented Aug 14, 2024 •

edited

Loading