update docs and add more tests

OpenBioML · Aug 13, 2024 · 84ccd84 · 84ccd84
1 parent 84ac753
commit 84ccd84
Show file tree

Hide file tree

Showing 10 changed files with 333 additions and 445 deletions.
diff --git a/data/tabular/bicerano_dataset/meta.yaml b/data/tabular/bicerano_dataset/meta.yaml
@@ -54,12 +54,12 @@ bibtex:
     year = {2021},
     doi = {10.1021/acsapm.0c00524}}
 templates:
-  - The polymer with the {PSMILES__description} of {PSMILES#} has an experimental glass transition temperature of {Tg_exp#} K.
-  - The polymer with the {PSMILES__description} of {PSMILES#} has a computed glass transition temperature of {Tg_calc#} K.
-  - The polymer with the {PSMILES__description} of {PSMILES#} has a computed density at 300 K of {rho_300K_calc#} g/cc.
-  - The polymer with the {compound_name__names__noun} of {compound_name#} has an experimental glass transition temperature of {Tg_exp#} K.
-  - The polymer with the {compound_name__names__noun} of {compound_name#} has a computed glass transition temperature of {Tg_calc#} K.
-  - The polymer with the {compound_name__names__noun} of {compound_name#} has a computed density at 300 K of {rho_300K_calc#} g/cc.
+  - The polymer with the {PSMILES__description} of {PSMILES#} has an experimental glass transition temperature of {Tg_exp#} {Tg_exp__units}.
+  - The polymer with the {PSMILES__description} of {PSMILES#} has a computed glass transition temperature of {Tg_calc#} {Tg_exp__units}.
+  - The polymer with the {PSMILES__description} of {PSMILES#} has a computed density at 300 K of {rho_300K_calc#} {rho_300K_calc__units}.
+  - The polymer with the {compound_name__names__noun} of {compound_name#} has an experimental glass transition temperature of {Tg_exp#} {Tg_exp__units}.
+  - The polymer with the {compound_name__names__noun} of {compound_name#} has a computed glass transition temperature of {Tg_calc#} {Tg_calc__units}.
+  - The polymer with the {compound_name__names__noun} of {compound_name#} has a computed density at 300 K of {rho_300K_calc#} {rho_300K_calc__units}.
   - |-
-    Question: What is a polymer with a computed glass transition temperature of {Tg_calc#} K and a computed density at 300 K of {rho_300K_calc#} g/cc.
+    Question: What is a polymer with a computed glass transition temperature of {Tg_calc#} {Tg_calc__units} and a computed density at 300 K of {rho_300K_calc#} {rho_300K_calc__units}.
     Answer: A polymer with {PSMILES__description} {PSMILES#}
diff --git a/docs/CONTRIBUTING.md b/docs/CONTRIBUTING.md
diff --git a/docs/EXPERIMENT.md b/docs/EXPERIMENT.md
diff --git a/docs/SUBMODULES.md b/docs/SUBMODULES.md
diff --git a/docs/api/meta_yaml_augmentor.md b/docs/api/meta_yaml_augmentor.md
@@ -0,0 +1,64 @@
+# Meta YAML Augmenter
+
+## Overview
+
+The Meta YAML Augmenter is a tool designed to enhance existing `meta.yaml` files for chemical datasets. It uses Large Language Models (LLMs) to generate additional templates and improve the metadata structure, particularly focusing on advanced sampling methods and template formats.
+
+## generate_augmented_meta_yaml
+
+::: chemnlp.data.meta_yaml_augmenter.generate_augmented_meta_yaml
+handler: python
+options:
+show_root_heading: true
+show_source: false
+
+## CLI Interface
+
+The module provides a command-line interface for easy augmentation of `meta.yaml` files.
+
+### Usage
+
+```bash
+python -m chemnlp.data.meta_yaml_augmenter <data_dir> [--model MODEL] [--override]
+```
+
+### Arguments
+
+- `data_dir` (str): Path to the directory containing the `meta.yaml` file to be augmented.
+- `--model` (str, optional): The name of the LLM model to use for augmentation. Default is 'gpt-4o'.
+- `--override` (flag): If set, the existing `meta.yaml` file will be overwritten with the augmented version.
+
+### Example
+
+```bash
+python -m chemnlp.data.meta_yaml_augmenter /path/to/dataset --model gpt-4o --override
+```
+
+## Augmentation Process
+
+The augmentation process involves:
+
+1. Reading the existing `meta.yaml` file from the specified directory.
+2. Sending the content to an LLM along with guidelines for creating advanced templates.
+3. Parsing the LLM's response to generate an augmented `meta.yaml` structure.
+4. Either printing the augmented structure or overwriting the existing file, based on the `override` flag.
+
+## Notes
+
+1. **LLM Integration**: This tool requires integration with an LLM service. Ensure you have the necessary credentials and access set up. By default it uses, `gpt-4o`. For this, you need to expose the `OPENAI_API_KEY` environment variable.
+
+2. **Output Quality**: The quality of the augmented `meta.yaml` depends on the capabilities of the LLM being used. Manual review and adjustment may be necessary.
+
+## Example Usage in Python
+
+```python
+from chemnlp.data.meta_yaml_augmenter import generate_augmented_meta_yaml
+
+data_dir = "/path/to/dataset"
+model_name = "gpt-4o"
+
+augmented_yaml = generate_augmented_meta_yaml(data_dir, model_name)
+
+if augmented_yaml:
+    print(yaml.dump(augmented_yaml))
+```
diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1,4 @@
+# ChemNLP
+
+ChemNLP is an effort to create the largest dataset of chemical data.
+We then use this dataset to train large language models (LLMs).
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -1,19 +1,20 @@
 site_name: ChemNLP Documentation
 theme:
   name: material
-  palette:
-    primary: teal
 nav:
   - Home: index.md
   - User Guide:
       - Installation: user-guide/installation.md
       - Quick Start: user-guide/quickstart.md
   - API Reference:
       - Sampler Module: api/sampler.md
+      - Sampler CLI: api/sampler_cli.md
+      - Meta YAML Generator: api/meta_yaml_generator.md
+      - Meta YAML Augmentor: api/meta_yaml_augmentor.md
   - Examples:
       - Basic Usage: examples/basic-usage.md
       - Advanced Techniques: examples/advanced-techniques.md
-  - Contributing: contributing.md
+  - Contributing: CONTRIBUTING.md
   - Changelog: changelog.md
 markdown_extensions:
   - pymdownx.highlight

diff --git a/src/chemnlp/data/meta_yaml_augmentor.py b/src/chemnlp/data/meta_yaml_augmentor.py
@@ -23,7 +23,7 @@
 `Is the {SMILES__description} {SMILES#} a {CYP2D6_Substrate__names__noun}:<EOI>{CYP2D6_Substrate#no&yes}`
 
 3. Conditional Statements:
-- Use {COLUMN#not &NULL} for conditional text based on column values.
+- Use {COLUMN#not &NULL} for conditional text based on column values. Note that this only makes sense for columns that are boolean.
 
 4. Random Choices:
 - Use {#option1|option2|option3!} for random selection of text.

diff --git a/src/chemnlp/data/sampler.py b/src/chemnlp/data/sampler.py
@@ -61,7 +61,6 @@ def __init__(
 
     def _wrap_identifier(self, identifier: str, value: str) -> str:
         """Wrap the identifier value with tags if wrap_identifiers is enabled."""
-        print("wrap_identifier", identifier, value, self.wrap_identifiers)
 
         if not self.wrap_identifiers:
             return value
@@ -164,6 +163,7 @@ def _get_target_from_row(self, sample: pd.Series, var: str) -> str:
         elif ("#" in var) and ("&" in var):
             var, choices = var.split("#")
             choices = choices.split("&")
+            print("var and choices and sample", var, choices, sample)
             choice = choices[sample[var]]
             return "" if choice == "NULL" else choice