This section shows a step-by-step instructions on how to generate the PH pairs, perform annotation, and generate post-process the annotations to generate train/val/test set.
We first generate the PH pairs. Make sure to replace the --cache_dir
to some temporary directory in your local filesystem.
python query_and_generate_ph_pairs.py \
--input_type=query_string \
--query_filepath=../../data/FooDB/foodb_queries.txt \
--allowed_ncbi_taxids_filepath=../../data/FoodAtlas/allowed_ncbi_taxids.tsv \
--cache_dir=/home/jasonyoun/Temp
Output files are as follows:
- ../../outputs/data_processing/query_results.txt
- ../../outputs/data_processing/ph_pairs_{timestamp}.txt
Using the PH pairs generated above, we randomly generate sample train/val/test set, which is ready for annotation.
python generate_pre_annotation.py \
--train_pre_annotation_filepath=../../outputs/data_processing/train_pool_pre_annotation.tsv
Output files are as follows:
- ../../outputs/data_processing/train_pool_pre_annotation.tsv
- ../../outputs/data_processing/val_pre_annotation.tsv
- ../../outputs/data_processing/test_pre_annotation.tsv
- ../../outputs/data_processing/to_predict.tsv
We used Label Studio deployed on Heroku to annnotate the PH pairs. Once finished with annotation, export the annotation files as a .tsv file with the name format specified below for each dataset.
# Input (train)
../../outputs/data_processing/train_pool_pre_annotation.tsv
# Output
../../outputs/data_processing/train_pool_post_annotation.tsv
# Input (val)
../../outputs/data_processing/val_pre_annotation.tsv
# Output
../../outputs/data_processing/val_post_annotation.tsv
# Input (test)
../../outputs/data_processing/test_pre_annotation.tsv
# Output
../../outputs/data_processing/test_post_annotation.tsv
We now need to post-process the annotation to generate a clean version of train/val/test set.
python post_process_annotation.py \
--train_post_annotation_filepath=../../outputs/data_processing/train_pool_post_annotation.tsv \
--train_filepath=../../outputs/data_processing/train_pool.tsv
Output files are as follows:
- ../../outputs/data_processing/train_pool.tsv
- ../../outputs/data_processing/val.tsv
- ../../outputs/data_processing/test.tsv
We need to do hyperparameter optimization for the deployment (final) entailment model. Run the following Python script to generate the necessary files.
python generate_folds.py \
--input_train_filepath=../../outputs/data_processing/train_pool.tsv \
--input_val_filepath=../../outputs/data_processing/val.tsv \
--input_test_filepath=../../outputs/data_processing/test.tsv \
--output_dir=../../outputs/data_processing/folds_for_prod_model
Following the above steps finished the data generation process (PH pairs and train/val/test set). In this work, we generated additional PH pairs.
We generated more queries using the food-chemical pairs extracted from each external DB as follows.
# Frida
python generate_food_chem_queries.py \
--input_filepath=../../data/Frida/frida.tsv \
--output_filepath=../../data/Frida/frida_queries.txt
# Phenol-Explorer
python generate_food_chem_queries.py \
--input_filepath=../../data/Phenol-Explorer/phenol_explorer.tsv \
--output_filepath=../../data/Phenol-Explorer/phenol_explorer_queries.txt
We then generated PH pairs using the queries generated above.
# Frida
python query_and_generate_ph_pairs.py \
--input_type=query_string \
--query_filepath=../../data/Frida/frida_queries.txt \
--allowed_ncbi_taxids_filepath=../../data/FoodAtlas/allowed_ncbi_taxids.tsv \
--cache_dir=/home/jasonyoun/Temp
# Phenol-Explorer
python query_and_generate_ph_pairs.py \
--input_type=query_string \
--query_filepath=../../data/Phenol-Explorer/phenol_explorer_queries.txt \
--allowed_ncbi_taxids_filepath=../../data/FoodAtlas/allowed_ncbi_taxids.tsv \
--cache_dir=/home/jasonyoun/Temp
The LitSense API is limited to 100 results for a given query. We collaborated with the LitSense team to internally generate bigger query results (maximum 50,000 results for a given query). We then used these results to generate the PH pairs.
python query_and_generate_ph_pairs.py \
--input_type=query_results \
--query_filepath=../../data/FoodAtlas/litsense_query/queries_output/*.json \
--allowed_ncbi_taxids_filepath=../../data/FoodAtlas/allowed_ncbi_taxids.tsv \
--cache_dir=/home/jasonyoun/Temp