Skip to content

Commit

Permalink
update docs and add more tests
Browse files Browse the repository at this point in the history
  • Loading branch information
kjappelbaum committed Aug 13, 2024
1 parent 84ac753 commit 84ccd84
Show file tree
Hide file tree
Showing 10 changed files with 333 additions and 445 deletions.
14 changes: 7 additions & 7 deletions data/tabular/bicerano_dataset/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -54,12 +54,12 @@ bibtex:
year = {2021},
doi = {10.1021/acsapm.0c00524}}
templates:
- The polymer with the {PSMILES__description} of {PSMILES#} has an experimental glass transition temperature of {Tg_exp#} K.
- The polymer with the {PSMILES__description} of {PSMILES#} has a computed glass transition temperature of {Tg_calc#} K.
- The polymer with the {PSMILES__description} of {PSMILES#} has a computed density at 300 K of {rho_300K_calc#} g/cc.
- The polymer with the {compound_name__names__noun} of {compound_name#} has an experimental glass transition temperature of {Tg_exp#} K.
- The polymer with the {compound_name__names__noun} of {compound_name#} has a computed glass transition temperature of {Tg_calc#} K.
- The polymer with the {compound_name__names__noun} of {compound_name#} has a computed density at 300 K of {rho_300K_calc#} g/cc.
- The polymer with the {PSMILES__description} of {PSMILES#} has an experimental glass transition temperature of {Tg_exp#} {Tg_exp__units}.
- The polymer with the {PSMILES__description} of {PSMILES#} has a computed glass transition temperature of {Tg_calc#} {Tg_exp__units}.
- The polymer with the {PSMILES__description} of {PSMILES#} has a computed density at 300 K of {rho_300K_calc#} {rho_300K_calc__units}.
- The polymer with the {compound_name__names__noun} of {compound_name#} has an experimental glass transition temperature of {Tg_exp#} {Tg_exp__units}.
- The polymer with the {compound_name__names__noun} of {compound_name#} has a computed glass transition temperature of {Tg_calc#} {Tg_calc__units}.
- The polymer with the {compound_name__names__noun} of {compound_name#} has a computed density at 300 K of {rho_300K_calc#} {rho_300K_calc__units}.
- |-
Question: What is a polymer with a computed glass transition temperature of {Tg_calc#} K and a computed density at 300 K of {rho_300K_calc#} g/cc.
Question: What is a polymer with a computed glass transition temperature of {Tg_calc#} {Tg_calc__units} and a computed density at 300 K of {rho_300K_calc#} {rho_300K_calc__units}.
Answer: A polymer with {PSMILES__description} {PSMILES#}
353 changes: 70 additions & 283 deletions docs/CONTRIBUTING.md

Large diffs are not rendered by default.

77 changes: 0 additions & 77 deletions docs/EXPERIMENT.md

This file was deleted.

73 changes: 0 additions & 73 deletions docs/SUBMODULES.md

This file was deleted.

64 changes: 64 additions & 0 deletions docs/api/meta_yaml_augmentor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Meta YAML Augmenter

## Overview

The Meta YAML Augmenter is a tool designed to enhance existing `meta.yaml` files for chemical datasets. It uses Large Language Models (LLMs) to generate additional templates and improve the metadata structure, particularly focusing on advanced sampling methods and template formats.

## generate_augmented_meta_yaml

::: chemnlp.data.meta_yaml_augmenter.generate_augmented_meta_yaml
handler: python
options:
show_root_heading: true
show_source: false

## CLI Interface

The module provides a command-line interface for easy augmentation of `meta.yaml` files.

### Usage

```bash
python -m chemnlp.data.meta_yaml_augmenter <data_dir> [--model MODEL] [--override]
```

### Arguments

- `data_dir` (str): Path to the directory containing the `meta.yaml` file to be augmented.
- `--model` (str, optional): The name of the LLM model to use for augmentation. Default is 'gpt-4o'.
- `--override` (flag): If set, the existing `meta.yaml` file will be overwritten with the augmented version.

### Example

```bash
python -m chemnlp.data.meta_yaml_augmenter /path/to/dataset --model gpt-4o --override
```

## Augmentation Process

The augmentation process involves:

1. Reading the existing `meta.yaml` file from the specified directory.
2. Sending the content to an LLM along with guidelines for creating advanced templates.
3. Parsing the LLM's response to generate an augmented `meta.yaml` structure.
4. Either printing the augmented structure or overwriting the existing file, based on the `override` flag.

## Notes

1. **LLM Integration**: This tool requires integration with an LLM service. Ensure you have the necessary credentials and access set up. By default it uses, `gpt-4o`. For this, you need to expose the `OPENAI_API_KEY` environment variable.

2. **Output Quality**: The quality of the augmented `meta.yaml` depends on the capabilities of the LLM being used. Manual review and adjustment may be necessary.

## Example Usage in Python

```python
from chemnlp.data.meta_yaml_augmenter import generate_augmented_meta_yaml

data_dir = "/path/to/dataset"
model_name = "gpt-4o"

augmented_yaml = generate_augmented_meta_yaml(data_dir, model_name)

if augmented_yaml:
print(yaml.dump(augmented_yaml))
```
4 changes: 4 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# ChemNLP

ChemNLP is an effort to create the largest dataset of chemical data.
We then use this dataset to train large language models (LLMs).
7 changes: 4 additions & 3 deletions mkdocs.yml
Original file line number Diff line number Diff line change
@@ -1,19 +1,20 @@
site_name: ChemNLP Documentation
theme:
name: material
palette:
primary: teal
nav:
- Home: index.md
- User Guide:
- Installation: user-guide/installation.md
- Quick Start: user-guide/quickstart.md
- API Reference:
- Sampler Module: api/sampler.md
- Sampler CLI: api/sampler_cli.md
- Meta YAML Generator: api/meta_yaml_generator.md
- Meta YAML Augmentor: api/meta_yaml_augmentor.md
- Examples:
- Basic Usage: examples/basic-usage.md
- Advanced Techniques: examples/advanced-techniques.md
- Contributing: contributing.md
- Contributing: CONTRIBUTING.md
- Changelog: changelog.md
markdown_extensions:
- pymdownx.highlight
Expand Down
2 changes: 1 addition & 1 deletion src/chemnlp/data/meta_yaml_augmentor.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
`Is the {SMILES__description} {SMILES#} a {CYP2D6_Substrate__names__noun}:<EOI>{CYP2D6_Substrate#no&yes}`
3. Conditional Statements:
- Use {COLUMN#not &NULL} for conditional text based on column values.
- Use {COLUMN#not &NULL} for conditional text based on column values. Note that this only makes sense for columns that are boolean.
4. Random Choices:
- Use {#option1|option2|option3!} for random selection of text.
Expand Down
2 changes: 1 addition & 1 deletion src/chemnlp/data/sampler.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,6 @@ def __init__(

def _wrap_identifier(self, identifier: str, value: str) -> str:
"""Wrap the identifier value with tags if wrap_identifiers is enabled."""
print("wrap_identifier", identifier, value, self.wrap_identifiers)

if not self.wrap_identifiers:
return value
Expand Down Expand Up @@ -164,6 +163,7 @@ def _get_target_from_row(self, sample: pd.Series, var: str) -> str:
elif ("#" in var) and ("&" in var):
var, choices = var.split("#")
choices = choices.split("&")
print("var and choices and sample", var, choices, sample)
choice = choices[sample[var]]
return "" if choice == "NULL" else choice

Expand Down
Loading

0 comments on commit 84ccd84

Please sign in to comment.