Releases: openfoodfacts/openfoodfacts-ai
ingredient-detection-v1
This dataset is used to train a multilingual ingredient list detection model. The goal is to automate the extraction of ingredient lists from OCR results. See this issue for a broader context about ingredient list extraction.
Dataset generation
Raw unannotated texts are OCR results obtained with Google Cloud Vision. It only contains images marked as ingredient image on Open Food Facts.
The dataset was generated using ChatGPT-3.5: we asked ChatGPT to extract ingredient using the following prompt:
Prompt:
Extract ingredient lists from the following texts. The ingredient list should start with the first ingredient and end with the last ingredient. It should not include allergy, label or origin information.
The output format must be a single JSON list containing one element per ingredient list. If there are ingredients in several languages, the output JSON list should contain as many elements as detected languages. Each element should have two fields:
- a "text" field containing the detected ingredient list. The text should be a substring of the original text, you must not alter the original text.
- a "lang" field containing the detected language of the ingredient list.
Don't output anything else than the expected JSON list.
System prompt:
You are ChatGPT, a large language model trained by OpenAI. Only generate responses in JSON format. The output JSON must be minified.
A first cleaning step was performed automatically, we removed responses with:
- invalid JSON
- JSON with missing fields
- JSON where the detected ingredient list is not a substring of the original text
A first NER model was trained on this dataset. The model prediction errors on this dataset were inspected, which allowed us to spot the different kind of annotation errors made by ChatGPT. Then, using a semi-automatic approach, we manually corrected samples that were likely to have the error spotted during the inspection phase. For example, we noticed that the prefix "Ingredients:" was sometimes included in the ingredient text span. We looked for every sample where "Ingredients" (and translations in other languages) was part of the ingredient text, and corrected these samples manually. This approach allowed us to focus on problematic samples, instead of having to check the full train set.
These detection rules were mostly implemented using regex. The cleaning script with all rules can be found here.
Once the detected errors were fixed using this approach, a new dataset alpha version was released, and we trained the model on this new dataset.
Dataset was split between train (90%) and test (10%) sets. Train and test splits were kept consistent at each alpha release. Only the test dataset was fully reviewed and corrected manually.
We tokenized the text using huggingface pre-tokenizer with the [WhitespaceSplit(), Punctuation()]
sequence. The dataset generation script can be found here.
This dataset is exactly the same as ingredient-detection-alpha-v6
used during model trainings.
Annotation guidelines
Annotations guidelines were updated continuously during dataset refinement and model trainings, but here are the final guidelines:
- ingredient lists in all languages must be annotated.
- ingredients list should start with the first ingredient, without
ingredient
prefix ("Ingredients:", "Zutaten", "Ingrédients: ") orlanguage
prefix ("EN:", "FR - ",...) - ingredient list containing single ingredients without any
ingredient
orlanguage
prefix should not be annotated. Otherwise, it's very difficult to know whether the mention is the ingredient list or just a random mention of an ingredient on the packaging. - We have a very restrictive approach on where the ingredient list ends: we don't include any extra information (allergen, origin, trace, organic mentions) at the end of the ingredient list. The only exception is when this information is in bracket after the ingredient. This rule is in place to make it easier for the detector to know what is an ingredient list and what is not. Additional information can be added afterward as a post-processing step.
Dataset schema
The dataset is made of 2 JSONL files:
ingredient_detection_dataset-v1_train.jsonl.gz
: train split, 5065 samplesingredient_detection_dataset-v1_test.jsonl.gz
: test split, 556 samples
Each sample has the following fields:
text
: the original text obtained from OCR resultmarked_text
: the text with ingredient spans delimited by<b>
and</b>
tokens
: tokens obtained with pre-tokenizationner_tags
: tag ID associated with each token: 0 forO
, 1 forB-ING
and 2 forI-ING
(BIO schema)offsets
: a list containing character start and end offsets of ingredients spansmeta
: a dict containing additional meta-data about the sample:barcode
: the product barcode of the image that was usedimage_id
: unique digit identifier of the image for the producturl
: image URL from which the text was extracted
Logo Dataset (2022-01-21)
This dataset contains annotated logos detected by the universal logo detector model. This dataset can be used to evaluate the performance of models in metric learning settings.
The dataset contains 374 categories and 64489 images. All logos were manually annotated by contributors using Hunger Games.
Annotated logos from DB were first extracted from the PostgreSQL DB using the query in extract_logo.sql
. Then, logos were filtered and grouped into categories using the create_logo_dataset.py
script.
As postprocessing, the clean_logo_dataset.py
script was run, to rename categories and delete inconsistent categories.
Finally, a (brief) manual inspection was done to remove inconsistent logos (especially for logos of the label
type). brand_Carrefour
logos were split manually between brand_Carrefour
and brand_Carrefour_text
(text-only logos).
The logo_dataset.tar.gz
file being above the 2GB file size limit, the archive was split using the split
command.
To merge the splits, use:
cat logo_dataset.tar.gz.* > logo_dataset.tar.gz
.
Finally, train, val and test splits were generated (with the split_train_test.py
script) with the following ratio: 0.8 for train, 0.1 for val and 0.1 for test. The split procedure went as follow:
- If the category contains >= 50 logos, split the logos between train, val and tests with the respect to the split ratios.
- Otherwise assign the category to a split with respect with the split ratios.
The generated splits are in train.txt
, val.txt
, test.txt
.
Category Classification Dataset (2021-09-15)
This dataset follows all of the guidelines used in the previous release.
This dataset was generated using the category_dataset script.
The full taxonomies are provided alongside the 3 ML datasets.
The original products dataset is omitted as it is above 2GBs, it is the JSONL dump as was available on https://static.openfoodfacts.org/data/openfoodfacts-products.jsonl.gz on 2021-09-15.
Category classification dataset (2020-06-30)
Category classification dataset.
Unlike the previous category dataset release, only the multi-lingual dataset (xx) is released. Only the products that met the following requirements were kept:
non empty categories_tags field
product_name is not empty
In each dataset, the following fields can be found:
code: product barcode
product_name: the product name in the main language
categories_tags
ingredient_tags
known_ingredient_tags: tags of ingredients found in the taxonomy
ingredients_text: the ingredient text in the main language
lang: main language of the product
images: images of the product
nutriments: nutritional values of the product
Compared to the previous release, two new fields are added: images and nutriments. It opens the possibility of using the product images or nutritional values as input to predict the categories.
The ingredient and category taxonomies used during dataset generation are also provided.
Number of products per split:
category_xx.train.jsonl.gz
: 551,757category_xx.val.jsonl.gz
: 68,969category_xx.test.jsonl.gz
: 68,969
Category classification dataset (2019-09-16)
Category classification datasets.
One dataset per major language was build along with a multilingual (xx
) dataset. Only the products that met the following requirements were kept:
- non empty
categories_tags
field
For language specific dataset, the following requirements must also be met:
lang
field is set to the input languageproduct_name_{lang}
is not empty
For the multilingual dataset:
product_name
is not empty
In each dataset, the following fields can be found:
code
: product barcodeproduct_name
: the language specific product name (or value ofproduct_name
for the multilingual dataset)categories_tags
ingredient_tags
known_ingredient_tags
: tags of ingredients found in the taxonomyingredients_text
: the language specific ingredient text (or value ofingredients_text
for the multilingual dataset)lang
The ingredient and category taxonomies used during dataset generation are also provided.