feat: dq_correct_wrong_lang__for_tags #9581

benbenben2 · 2023-12-21T16:59:12Z

What

Python code to:

fetch unknown tags for categories, allergens, etc.
check if each unknown tag exists in the taxonomy in different language
if yes, to all products having the tag in wrong language:
- add existing tag in different language
- remove tag in wrong language

Code is generalized to - not only categories, but also - additional tags fields.

Lots of comments in the code, which should be helpful for reviewer and user.

Instructions on how to run the code, test it, run it in dev vs prod, are provided on top of the file.

The code was run successfully - in net environment - for 4 different tags:

"en:teilentrahmte-milch" => "de:teilentrahmte-milch"
"en:trempettes" => "fr:trempettes"
"en:turrones-de-chocolate" => "es:turrones-de-chocolate"
"en:кисели-краставички" => "bg:кисели-краставички"

Corresponding to the following products that have been updated:

4316268628273,categories;en:teilentrahmte-milch;de:teilentrahmte-milch
4061462842764,categories;en:teilentrahmte-milch;de:teilentrahmte-milch
4316268153898,categories;en:teilentrahmte-milch;de:teilentrahmte-milch
4300175269025,categories;en:teilentrahmte-milch;de:teilentrahmte-milch
3596710467815,categories;en:trempettes;fr:trempettes
8410086391138,categories;en:trempettes;fr:trempettes
5410921014546,categories;en:trempettes;fr:trempettes
0234590005996,categories;en:trempettes;fr:trempettes
5295000700387,categories;en:trempettes;fr:trempettes
8480017232182,categories;en:turrones-de-chocolate;es:turrones-de-chocolate
8410223704418,categories;en:turrones-de-chocolate;es:turrones-de-chocolate
8410886113527,categories;en:turrones-de-chocolate;es:turrones-de-chocolate
8480017229793,categories;en:turrones-de-chocolate;es:turrones-de-chocolate
8480000362537,categories;en:turrones-de-chocolate;es:turrones-de-chocolate
3800203290881,categories;en:кисели-краставички;bg:кисели-краставички
3800229785491,categories;en:кисели-краставички;bg:кисели-краставички
3800500012469,categories;en:кисели-краставички;bg:кисели-краставички
3800214420604,categories;en:кисели-краставички;bg:кисели-краставички
3800128903026,categories;en:кисели-краставички;bg:кисели-краставички
5290113000089,categories;en:кисели-краставички;bg:кисели-краставички
3800203741154,categories;en:кисели-краставички;bg:кисели-краставички
3800234140018,categories;en:кисели-краставички;bg:кисели-краставички

Related issue(s) and discussion

Fixes Create a script to correct all taxonomized values for categories that are stored in an incorrect language by storing them in the right language #7838

sonarqubecloud · 2023-12-21T17:03:20Z

Quality Gate passed

The SonarCloud Quality Gate passed, but some issues were introduced.

16 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

codecov-commenter · 2023-12-21T17:30:56Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (44368ea) 49.10% compared to head (6a04a8e) 49.24%.
Report is 39 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #9581      +/-   ##
==========================================
+ Coverage   49.10%   49.24%   +0.14%     
==========================================
  Files          66       66              
  Lines       20451    20532      +81     
  Branches     4910     4944      +34     
==========================================
+ Hits        10042    10112      +70     
  Misses       9132     9132              
- Partials     1277     1288      +11

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

alexgarel

@benbenben2 that's a really great job !

I have no blocker, so I approve.

But one think I would advise to do is to be able to be able to recover in case of incident:
That is:

after the search phase, store results in json
eventually skip search phase if I have json results and jump to update
use the already updated products information you log to skip already done updates.

alexgarel · 2024-02-05T13:11:17Z

scripts/update_tags_per_languages.py

+dev = True
+if dev:
+    env = "net"
+    user = "off:off@" 
+else:
+    env = "org"
+    user = "" 
+
+headers = {
+    'Accept': 'application/json', 
+    'User-Agent': 'UpdateTagsLanguages',
+}
+data = {
+    'user_id': '',
+    'password': '',
+}


We could use argparse for this

scripts/update_tags_per_languages.py

alexgarel · 2024-02-05T15:04:01Z

scripts/update_tags_per_languages.py

+        tags_list_url = f"https://{user}world.openfoodfacts.{env}/{plural}.json"
+
+        # by default the query return 24 results. 
+        # increase to 1000 (so far chips in EN (should be crisps in EN)
+        #   had max number of products for categories with ~550)
+        products_list_for_tag_url = f"https://{user}world.openfoodfacts.{env}/{singular}/{{tag_id_placeholder}}.json?page_size=1000"
+
+        file = os.path.abspath(os.path.join(os.path.dirname( __file__ ), '..', f'taxonomies/{plural}.txt')) 
+
+        # country is needed otherwise <plural>_lc will be "en"
+        post_call_url = f"https://{user}{{country}}.openfoodfacts.{env}/cgi/product_jqm2.pl"


(not to change) Just in case you care typically this kind of globals would better fits being attributes of a class, where you then call a method to process one tag.

class ProcessTag:

def init(self, plural, singular):
self.tags_list_url = …
...
def process(self):
all_tags = self.get_from_api()
etc.

alexgarel · 2024-02-05T15:05:45Z

scripts/update_tags_per_languages.py

+            # should retrieve all "en:blablabla, tag_name" or "it:tag_name"
+            tag_regex = re.compile(f'\n([a-z][a-z]:(?:[\w\s\-\']*\,-)*{tag_name})[,|\n]')
+
+            with open(file, "r") as f:


I would have used the json of taxonomy (like https://static.openfoodfacts.org/data/taxonomies/categories.json) and loaded it in memory. This way you don't have to parse anything but more iterate on a map.

But I would do it once and for all outside the for loop.

It does not show the synonyms in the json.
I would keep the taxonomy file locally instead.

alexgarel · 2024-02-05T15:07:00Z

scripts/update_tags_per_languages.py

+            # should retrieve all "en:blablabla, tag_name" or "it:tag_name"
+            tag_regex = re.compile(f'\n([a-z][a-z]:(?:[\w\s\-\']*\,-)*{tag_name})[,|\n]')


I'm not sure why we include a comma in it ? Why don't we stop on commas ?
Is this to ensure, tag_name appear in a synonym list ?

f'\n([a-z][a-z]:(?:[\w\s-'],-){tag_name})[,|\n]'

There are 2 commas in the regex.

if you have:

en:Hamburger buns, Burger buns, Rolls for hamburger, Buns for hamburger

and you are searching for "Rolls for hamburger"
the first comma in the regex would ignore the comma at the end of

en:Hamburger buns, Burger buns,

the second comma in the regex would ignore the comma right after the occurence:

, Buns for hamburger

Ok, I see, I'll try to comment, feel free to change:

Suggested change

# should retrieve all "en:blablabla, tag_name" or "it:tag_name"

tag_regex = re.compile(f'\n([a-z][a-z]:(?:[\w\s\-\']*\,-)*{tag_name})[,|\n]')

# should retrieve all "en:blablabla, tag_name" or "it:tag_name"

# the prefix is either the language or a comma.

# Suffix is either an end of line or a comma

tag_regex = re.compile(f'\n([a-z][a-z]:(?:[\w\s\-\']*\,-)*{tag_name})[,|\n]')

Do you know you can also use positive lookahead or lookbehing ((?=...) and (?<=...))

Or better just capture with a name the part you are interested in:

tag_regex = re.compile(f'\n([a-z][a-z]:(?:[\w\s\-\']*\,-)*(?P<tag_name>{tag_name})[,|\n]')

Then you can use the match.group("tag_name") to get your tag_name.

interesting

Actual regex captures the whole block: "([a-z][a-z]:(?:[\w\s\-\']*\,-)*({tag_name})"
It can be "en:blablabla, tag_name" or "it:tag_name", for example.

In the next lines, it checks if it starts with "en:" or another language, and if not it keeps the first element of the list of synonyms (for example, if we looked for "Alga spirulina" and it found "it:Spirulina, Alga spirulina", then it will keep "it:Spirulina")

Eventually, we could put "(?P[a-z][a-z]:(?:[\w\s-'],-)(?P<tag_name>{tag_name})", but maybe it is good as it is already now.

alexgarel · 2024-02-05T15:17:55Z

scripts/update_tags_per_languages.py

+            if dev:
+                i = 0
+            for product in all_products_for_tag["products"]:
+                if dev:
+                    if i > 0:
+                        break
+                    i += 1


It's ok to always count, and it's easier to read :-)

Suggested change

if dev:

i = 0

for product in all_products_for_tag["products"]:

if dev:

if i > 0:

break

i += 1

for i, product in enumerate(all_products_for_tag["products"]):

if dev and i > 0:

break

alexgarel · 2024-02-05T15:20:16Z

scripts/update_tags_per_languages.py

+    # only save possible new tags to be reviewed and added
+    with open("update_tags_per_languages_possible_new_tags_" + tag_type, "w") as f:
+        f.write("possible_new_tags")
+    with open("update_tags_per_languages_possible_new_tags_" + tag_type, "a") as f:
+        for possible_new_tag in possible_new_tags:
+            f.write("\n" + possible_new_tag) 


The search can be long, so I would save all the data you found out in a json file (for each map).

This way you could add an option to only load those json file and process them (if they exists), avoiding rerunning the search phase in case the update phase crashes.

alexgarel · 2024-02-05T15:36:30Z

scripts/update_tags_per_languages.py

+                with open(logs_file, "a") as f:
+                    f.write(f"{product['code']},{plural};{current_tag};{updated_tag}\n") 


nit: you can also keep the file open, and use f.flush() to ensure you writes are immediately written on disk.

I find it great that you write what is processed. You could check for this file to avoid re-processing things already done in case the script failed at some point.

alexgarel · 2024-02-05T15:37:51Z

scripts/update_tags_per_languages_tests.py

+            ],
+        }
+        file = os.path.abspath(os.path.join(os.path.dirname( __file__ ), '..', f'taxonomies/categories.txt')) 
+        possible_wrong_language_tags = unknown_tags_taxonomy_comparison(all_tags_dict, file, "categories")


great to have tested this function

alexgarel · 2024-02-05T15:39:39Z

scripts/update_tags_per_languages.py

+
+logs_file = "update_tags_per_languages_logs"
+if os.path.isfile(logs_file):
+    os.remove(logs_file)


maybe not if you want to continue an old failed script.

benbenben2 · 2024-02-17T16:26:50Z

@alexgarel

I applied your suggestions, script now:

can be run with input parameters
can be resumed if it crashes

The code was run for a single tag in .net environment. The file update_tags_per_languages_wrong_languages_detected_categories contained only:

current tag, :found tag
en:ядки,bg:ядки

The script was interrupted before the end of the update and run again. It resumed the updates.

Before

After

Note that it did not add new tag, because the tag was already listed in English:

<en:Nuts and their products
en:Nuts, Fruit nuts
bg:Ядки

alexgarel

Great job @benbenben2 !

alexgarel · 2024-03-04T13:55:07Z

scripts/update_tags_per_languages.py

        for possible_new_tag in possible_new_tags:
-            f.write("\n" + possible_new_tag) 
+            output_possible_new_tag_file.write("\n" + possible_new_tag) 


nit: More pythonic:

output_possible_new_tag_file.write("\n".join(possible_new_tags))

scripts/update_tags_per_languages.py

Co-authored-by: Alex Garel <alex@openfoodfacts.org>

sonarqubecloud · 2024-03-05T05:50:32Z

Quality Gate passed

Issues
21 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

Python code to: - fetch unknown tags for categories, allergens, etc. - check if each unknown tag exists in the taxonomy in different language - if yes, to all products having the tag in wrong language: - add existing tag in different language - remove tag in wrong language Code is generalized to - not only categories, but also - additional tags fields. - Part of #7838 --------- Co-authored-by: Alex Garel <alex@openfoodfacts.org>

dq_correct_wrong_lang__for_tags

6a04a8e

benbenben2 self-assigned this Dec 21, 2023

benbenben2 requested a review from a team as a code owner December 21, 2023 16:59

benbenben2 added the 🧽 Data quality https://wiki.openfoodfacts.org/Quality label Dec 21, 2023

alexgarel approved these changes Feb 5, 2024

View reviewed changes

applied suggestions

3a14edd

alexgarel approved these changes Mar 4, 2024

View reviewed changes

Apply suggestions from code review

5338afe

Co-authored-by: Alex Garel <alex@openfoodfacts.org>

alexgarel merged commit 0986951 into main Apr 2, 2024
12 checks passed

alexgarel deleted the dq_correct_wrong_lang__for_tags branch April 2, 2024 08:10

openfoodfacts-bot mentioned this pull request Apr 2, 2024

chore(main): release 2.30.0 #9917

Merged

benbenben2 mentioned this pull request Aug 18, 2024

feat: update script to correct wrong languages for tags #10711

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: dq_correct_wrong_lang__for_tags #9581

feat: dq_correct_wrong_lang__for_tags #9581

benbenben2 commented Dec 21, 2023

sonarqubecloud bot commented Dec 21, 2023

codecov-commenter commented Dec 21, 2023

alexgarel left a comment

alexgarel Feb 5, 2024

alexgarel Feb 5, 2024

alexgarel Feb 5, 2024

benbenben2 Feb 16, 2024

alexgarel Feb 5, 2024

benbenben2 Feb 11, 2024

alexgarel Feb 14, 2024 •

edited

Loading

benbenben2 Feb 16, 2024

alexgarel Feb 5, 2024

alexgarel Feb 5, 2024

alexgarel Feb 5, 2024

alexgarel Feb 5, 2024

alexgarel Feb 5, 2024

benbenben2 commented Feb 17, 2024

alexgarel left a comment

alexgarel Mar 4, 2024

sonarqubecloud bot commented Mar 5, 2024

		# should retrieve all "en:blablabla, tag_name" or "it:tag_name"
		tag_regex = re.compile(f'\n([a-z][a-z]:(?:[\w\s\-\']\,-){tag_name})[,\|\n]')

		with open(logs_file, "a") as f:
		f.write(f"{product['code']},{plural};{current_tag};{updated_tag}\n")

feat: dq_correct_wrong_lang__for_tags #9581

feat: dq_correct_wrong_lang__for_tags #9581

Conversation

benbenben2 commented Dec 21, 2023

What

Related issue(s) and discussion

sonarqubecloud bot commented Dec 21, 2023

Quality Gate passed

codecov-commenter commented Dec 21, 2023

Codecov Report

alexgarel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexgarel Feb 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benbenben2 commented Feb 17, 2024

alexgarel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonarqubecloud bot commented Mar 5, 2024

Quality Gate passed

alexgarel Feb 14, 2024 •

edited

Loading