Semi-Automated Construction of Food Composition Knowledge Base

A food composition knowledge base, which stores the essential phyto-, micro-, and macro-nutrients of foods is useful for both research and industrial applications. Although many existing knowledge bases attempt to curate such information, they are often limited by time-consuming manual curation processes. Outside of the food science domain, natural language processing methods that utilize pre-trained language models have recently shown promising results for extracting knowledge from unstructured text. In this work, we propose a semi-automated framework for constructing a knowledge base of food composition from the scientific literature available online. To this end, we utilize a pre-trained BioBERT language model in an active learning setup that allows the optimal use of limited training data. Our work demonstrates how human-in-the-loop models are a step toward AI-assisted food systems that scale well to the ever-increasing big data.

Prerequisites

This code has been tested with

Python 3.8

To prevent dependency problems, please use either virtualenv...

# Activate Python virtualenv
python3 -mvenv env
source ./env/bin/activate

# Dectivate Python virtualenv
deactivate

or conda...

# Activate Conda environment
conda create -n mvenv python

# Deactivate Conda environment
conda deactivate

In your environment, please install python packages.

pip install -r requirement.txt

Running

1. Query LitSense and Generate PH pairs.

cd src/data_generation
python query_and_generate_ph_pairs.py

Generates following output files.
- ../../outputs/data_generation/query_results.txt
- ../../outputs/data_generation/ph_pairs_{timestamp}.txt

2. Generate pre-annotation.

python generate_pre_annotation.py \
    --train_pre_annotation_filepath=../../outputs/data_generation/train_pool_pre_annotation.tsv

Generates following output files.
- ../../outputs/data_generation/train_pool_pre_annotation.tsv
- ../../outputs/data_generation/val_pre_annotation.tsv
- ../../outputs/data_generation/test_pre_annotation.tsv

3. (Manual) Annotate the pre_annotation files generated above. When finished, save the file names as below.

Save the annotated files as follows.
- ../../outputs/data_generation/train_pool_post_annotation.tsv
- ../../outputs/data_generation/val_post_annotation.tsv
- ../../outputs/data_generation/test_post_annotation.tsv

4. Post process the annotation.

python post_process_annotation.py \
    --train_post_annotation_filepath=../../outputs/data_generation/train_pool_post_annotation.tsv \
    --train_filepath=../../outputs/data_generation/train_pool.tsv

Generates following output files.
- ../../outputs/data_generation/train_pool.tsv
- ../../outputs/data_generation/val.tsv
- ../../outputs/data_generation/test.tsv

5. Run the entailment model.

Run the SLURM shell scripts to initiate the active learning sessions with the entailment model. This will take around tens of hours to several days depending on your GPUs. In both *run*.sh files, you need to configure:

SLURM configuration, e.g., email, log paths, etc.
PATH_OUTPUT, the path to store trained models, statistics, etc.

cd scripts
./1_run_stratified.sh
./2_run_uncertain.sh

After running the above, the model training and evaluation results can be found in PATH_OUTPUT. The visualization of the statistics can be found in the outputs/ directory under the root repo.

Authors

Jason Youn @ https://github.com/jasonyoun
Fangzhou Li @ https://github.com/fangzhouli

Contact

For any questions, please contact us at tagkopouloslab@ucdavis.edu.

Citation

@inproceedings{
youn2023semiautomated,
title={Semi-Automated Construction of Food Composition Knowledge Base},
author={Jason Youn and Fangzhou Li and Ilias Tagkopoulos},
booktitle={2nd AAAI Workshop on AI for Agriculture and Food Systems},
year={2023},
url={https://openreview.net/forum?id=4I7WLDmseD}
}

License

This project is licensed under the Apache-2.0 License. Please see the LICENSE file for details.

Acknowledgments

We would like to thank the members of the Tagkopoulos lab for their suggestions and Gabriel Simmons for the initial discussions.
This work was supported by...
- USDA-NIFA AI Institute for Next Generation Food Systems (AIFS), USDA-NIFA award number 2020-67021-32855
- NIEHS grant P42ES004699

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
figures		figures
logs		logs
outputs		outputs
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semi-Automated Construction of Food Composition Knowledge Base

Prerequisites

Running

1. Query LitSense and Generate PH pairs.

2. Generate pre-annotation.

3. (Manual) Annotate the pre_annotation files generated above. When finished, save the file names as below.

4. Post process the annotation.

5. Run the entailment model.

Authors

Contact

Citation

License

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

License

IBPA/SemiAutomatedFoodKBC

Folders and files

Latest commit

History

Repository files navigation

Semi-Automated Construction of Food Composition Knowledge Base

Prerequisites

Running

1. Query LitSense and Generate PH pairs.

2. Generate pre-annotation.

3. (Manual) Annotate the pre_annotation files generated above. When finished, save the file names as below.

4. Post process the annotation.

5. Run the entailment model.

Authors

Contact

Citation

License

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages