Skip to content

Releases: openfoodfacts/openfoodfacts-ai

ingredient-detection-v1

16 Aug 09:48
Compare
Choose a tag to compare

This dataset is used to train a multilingual ingredient list detection model. The goal is to automate the extraction of ingredient lists from OCR results. See this issue for a broader context about ingredient list extraction.

Dataset generation

Raw unannotated texts are OCR results obtained with Google Cloud Vision. It only contains images marked as ingredient image on Open Food Facts.
The dataset was generated using ChatGPT-3.5: we asked ChatGPT to extract ingredient using the following prompt:

Prompt:

Extract ingredient lists from the following texts. The ingredient list should start with the first ingredient and end with the last ingredient. It should not include allergy, label or origin information.
The output format must be a single JSON list containing one element per ingredient list. If there are ingredients in several languages, the output JSON list should contain as many elements as detected languages. Each element should have two fields:
- a "text" field containing the detected ingredient list. The text should be a substring of the original text, you must not alter the original text.
- a  "lang" field containing the detected language of the ingredient list.
Don't output anything else than the expected JSON list.

System prompt:

You are ChatGPT, a large language model trained by OpenAI. Only generate responses in JSON format. The output JSON must be minified.

A first cleaning step was performed automatically, we removed responses with:

  • invalid JSON
  • JSON with missing fields
  • JSON where the detected ingredient list is not a substring of the original text

A first NER model was trained on this dataset. The model prediction errors on this dataset were inspected, which allowed us to spot the different kind of annotation errors made by ChatGPT. Then, using a semi-automatic approach, we manually corrected samples that were likely to have the error spotted during the inspection phase. For example, we noticed that the prefix "Ingredients:" was sometimes included in the ingredient text span. We looked for every sample where "Ingredients" (and translations in other languages) was part of the ingredient text, and corrected these samples manually. This approach allowed us to focus on problematic samples, instead of having to check the full train set.

These detection rules were mostly implemented using regex. The cleaning script with all rules can be found here.

Once the detected errors were fixed using this approach, a new dataset alpha version was released, and we trained the model on this new dataset.
Dataset was split between train (90%) and test (10%) sets. Train and test splits were kept consistent at each alpha release. Only the test dataset was fully reviewed and corrected manually.

We tokenized the text using huggingface pre-tokenizer with the [WhitespaceSplit(), Punctuation()] sequence. The dataset generation script can be found here.

This dataset is exactly the same as ingredient-detection-alpha-v6 used during model trainings.

Annotation guidelines

Annotations guidelines were updated continuously during dataset refinement and model trainings, but here are the final guidelines:

  1. ingredient lists in all languages must be annotated.
  2. ingredients list should start with the first ingredient, without ingredient prefix ("Ingredients:", "Zutaten", "Ingrédients: ") or language prefix ("EN:", "FR - ",...)
  3. ingredient list containing single ingredients without any ingredient or language prefix should not be annotated. Otherwise, it's very difficult to know whether the mention is the ingredient list or just a random mention of an ingredient on the packaging.
  4. We have a very restrictive approach on where the ingredient list ends: we don't include any extra information (allergen, origin, trace, organic mentions) at the end of the ingredient list. The only exception is when this information is in bracket after the ingredient. This rule is in place to make it easier for the detector to know what is an ingredient list and what is not. Additional information can be added afterward as a post-processing step.

Dataset schema

The dataset is made of 2 JSONL files:

  • ingredient_detection_dataset-v1_train.jsonl.gz: train split, 5065 samples
  • ingredient_detection_dataset-v1_test.jsonl.gz: test split, 556 samples

Each sample has the following fields:

  • text: the original text obtained from OCR result
  • marked_text: the text with ingredient spans delimited by <b> and </b>
  • tokens: tokens obtained with pre-tokenization
  • ner_tags: tag ID associated with each token: 0 for O, 1 for B-ING and 2 for I-ING (BIO schema)
  • offsets: a list containing character start and end offsets of ingredients spans
  • meta: a dict containing additional meta-data about the sample:
    • barcode: the product barcode of the image that was used
    • image_id: unique digit identifier of the image for the product
    • url: image URL from which the text was extracted

Logo Dataset (2022-01-21)

21 Jan 15:54
969751e
Compare
Choose a tag to compare

This dataset contains annotated logos detected by the universal logo detector model. This dataset can be used to evaluate the performance of models in metric learning settings.

The dataset contains 374 categories and 64489 images. All logos were manually annotated by contributors using Hunger Games.

Annotated logos from DB were first extracted from the PostgreSQL DB using the query in extract_logo.sql. Then, logos were filtered and grouped into categories using the create_logo_dataset.py script.
As postprocessing, the clean_logo_dataset.py script was run, to rename categories and delete inconsistent categories.
Finally, a (brief) manual inspection was done to remove inconsistent logos (especially for logos of the label type). brand_Carrefour logos were split manually between brand_Carrefour and brand_Carrefour_text (text-only logos).

The logo_dataset.tar.gz file being above the 2GB file size limit, the archive was split using the split command.
To merge the splits, use:

cat logo_dataset.tar.gz.* > logo_dataset.tar.gz.

Finally, train, val and test splits were generated (with the split_train_test.py script) with the following ratio: 0.8 for train, 0.1 for val and 0.1 for test. The split procedure went as follow:

  • If the category contains >= 50 logos, split the logos between train, val and tests with the respect to the split ratios.
  • Otherwise assign the category to a split with respect with the split ratios.

The generated splits are in train.txt, val.txt, test.txt.

Category Classification Dataset (2021-09-15)

15 Sep 12:52
969751e
Compare
Choose a tag to compare

This dataset follows all of the guidelines used in the previous release.

This dataset was generated using the category_dataset script.

The full taxonomies are provided alongside the 3 ML datasets.

The original products dataset is omitted as it is above 2GBs, it is the JSONL dump as was available on https://static.openfoodfacts.org/data/openfoodfacts-products.jsonl.gz on 2021-09-15.

Category classification dataset (2020-06-30)

01 Jul 15:06
Compare
Choose a tag to compare

Category classification dataset.
Unlike the previous category dataset release, only the multi-lingual dataset (xx) is released. Only the products that met the following requirements were kept:

non empty categories_tags field
product_name is not empty

In each dataset, the following fields can be found:

code: product barcode
product_name: the product name in the main language
categories_tags
ingredient_tags
known_ingredient_tags: tags of ingredients found in the taxonomy
ingredients_text: the ingredient text in the main language
lang: main language of the product
images: images of the product
nutriments: nutritional values of the product

Compared to the previous release, two new fields are added: images and nutriments. It opens the possibility of using the product images or nutritional values as input to predict the categories.

The ingredient and category taxonomies used during dataset generation are also provided.

Number of products per split:

  • category_xx.train.jsonl.gz: 551,757
  • category_xx.val.jsonl.gz: 68,969
  • category_xx.test.jsonl.gz: 68,969

Category classification dataset (2019-09-16)

16 Sep 16:12
Compare
Choose a tag to compare

Category classification datasets.
One dataset per major language was build along with a multilingual (xx) dataset. Only the products that met the following requirements were kept:

  • non empty categories_tags field

For language specific dataset, the following requirements must also be met:

  • lang field is set to the input language
  • product_name_{lang} is not empty

For the multilingual dataset:

  • product_name is not empty

In each dataset, the following fields can be found:

  • code: product barcode
  • product_name: the language specific product name (or value of product_namefor the multilingual dataset)
  • categories_tags
  • ingredient_tags
  • known_ingredient_tags: tags of ingredients found in the taxonomy
  • ingredients_text: the language specific ingredient text (or value of ingredients_textfor the multilingual dataset)
  • lang

The ingredient and category taxonomies used during dataset generation are also provided.