Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: dq_correct_wrong_lang__for_tags #9581

Merged
merged 3 commits into from
Apr 2, 2024

Conversation

benbenben2
Copy link
Collaborator

What

Python code to:

  • fetch unknown tags for categories, allergens, etc.
  • check if each unknown tag exists in the taxonomy in different language
  • if yes, to all products having the tag in wrong language:
    • add existing tag in different language
    • remove tag in wrong language

Code is generalized to - not only categories, but also - additional tags fields.

Lots of comments in the code, which should be helpful for reviewer and user.

Instructions on how to run the code, test it, run it in dev vs prod, are provided on top of the file.

The code was run successfully - in net environment - for 4 different tags:

"en:teilentrahmte-milch" => "de:teilentrahmte-milch"
"en:trempettes" => "fr:trempettes"
"en:turrones-de-chocolate" => "es:turrones-de-chocolate"
"en:кисели-краставички" => "bg:кисели-краставички"

Corresponding to the following products that have been updated:

4316268628273,categories;en:teilentrahmte-milch;de:teilentrahmte-milch
4061462842764,categories;en:teilentrahmte-milch;de:teilentrahmte-milch
4316268153898,categories;en:teilentrahmte-milch;de:teilentrahmte-milch
4300175269025,categories;en:teilentrahmte-milch;de:teilentrahmte-milch
3596710467815,categories;en:trempettes;fr:trempettes
8410086391138,categories;en:trempettes;fr:trempettes
5410921014546,categories;en:trempettes;fr:trempettes
0234590005996,categories;en:trempettes;fr:trempettes
5295000700387,categories;en:trempettes;fr:trempettes
8480017232182,categories;en:turrones-de-chocolate;es:turrones-de-chocolate
8410223704418,categories;en:turrones-de-chocolate;es:turrones-de-chocolate
8410886113527,categories;en:turrones-de-chocolate;es:turrones-de-chocolate
8480017229793,categories;en:turrones-de-chocolate;es:turrones-de-chocolate
8480000362537,categories;en:turrones-de-chocolate;es:turrones-de-chocolate
3800203290881,categories;en:кисели-краставички;bg:кисели-краставички
3800229785491,categories;en:кисели-краставички;bg:кисели-краставички
3800500012469,categories;en:кисели-краставички;bg:кисели-краставички
3800214420604,categories;en:кисели-краставички;bg:кисели-краставички
3800128903026,categories;en:кисели-краставички;bg:кисели-краставички
5290113000089,categories;en:кисели-краставички;bg:кисели-краставички
3800203741154,categories;en:кисели-краставички;bg:кисели-краставички
3800234140018,categories;en:кисели-краставички;bg:кисели-краставички

Related issue(s) and discussion

@benbenben2 benbenben2 self-assigned this Dec 21, 2023
@benbenben2 benbenben2 requested a review from a team as a code owner December 21, 2023 16:59
@benbenben2 benbenben2 added the 🧽 Data quality https://wiki.openfoodfacts.org/Quality label Dec 21, 2023
Copy link

Quality Gate Passed Quality Gate passed

The SonarCloud Quality Gate passed, but some issues were introduced.

16 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (44368ea) 49.10% compared to head (6a04a8e) 49.24%.
Report is 39 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9581      +/-   ##
==========================================
+ Coverage   49.10%   49.24%   +0.14%     
==========================================
  Files          66       66              
  Lines       20451    20532      +81     
  Branches     4910     4944      +34     
==========================================
+ Hits        10042    10112      +70     
  Misses       9132     9132              
- Partials     1277     1288      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@alexgarel alexgarel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benbenben2 that's a really great job !

I have no blocker, so I approve.

But one think I would advise to do is to be able to be able to recover in case of incident:
That is:

  • after the search phase, store results in json
  • eventually skip search phase if I have json results and jump to update
  • use the already updated products information you log to skip already done updates.

Comment on lines 62 to 77
dev = True
if dev:
env = "net"
user = "off:off@"
else:
env = "org"
user = ""

headers = {
'Accept': 'application/json',
'User-Agent': 'UpdateTagsLanguages',
}
data = {
'user_id': '',
'password': '',
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use argparse for this

scripts/update_tags_per_languages.py Show resolved Hide resolved
Comment on lines 287 to 297
tags_list_url = f"https://{user}world.openfoodfacts.{env}/{plural}.json"

# by default the query return 24 results.
# increase to 1000 (so far chips in EN (should be crisps in EN)
# had max number of products for categories with ~550)
products_list_for_tag_url = f"https://{user}world.openfoodfacts.{env}/{singular}/{{tag_id_placeholder}}.json?page_size=1000"

file = os.path.abspath(os.path.join(os.path.dirname( __file__ ), '..', f'taxonomies/{plural}.txt'))

# country is needed otherwise <plural>_lc will be "en"
post_call_url = f"https://{user}{{country}}.openfoodfacts.{env}/cgi/product_jqm2.pl"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not to change) Just in case you care typically this kind of globals would better fits being attributes of a class, where you then call a method to process one tag.

class ProcessTag:

def init(self, plural, singular):
self.tags_list_url = …
...
def process(self):
all_tags = self.get_from_api()
etc.

# should retrieve all "en:blablabla, tag_name" or "it:tag_name"
tag_regex = re.compile(f'\n([a-z][a-z]:(?:[\w\s\-\']*\,-)*{tag_name})[,|\n]')

with open(file, "r") as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have used the json of taxonomy (like https://static.openfoodfacts.org/data/taxonomies/categories.json) and loaded it in memory. This way you don't have to parse anything but more iterate on a map.

But I would do it once and for all outside the for loop.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not show the synonyms in the json.
I would keep the taxonomy file locally instead.

Comment on lines 163 to 164
# should retrieve all "en:blablabla, tag_name" or "it:tag_name"
tag_regex = re.compile(f'\n([a-z][a-z]:(?:[\w\s\-\']*\,-)*{tag_name})[,|\n]')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we include a comma in it ? Why don't we stop on commas ?
Is this to ensure, tag_name appear in a synonym list ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f'\n([a-z][a-z]:(?:[\w\s-'],-){tag_name})[,|\n]'

There are 2 commas in the regex.

if you have:

en:Hamburger buns, Burger buns, Rolls for hamburger, Buns for hamburger

and you are searching for "Rolls for hamburger"
the first comma in the regex would ignore the comma at the end of

en:Hamburger buns, Burger buns,

the second comma in the regex would ignore the comma right after the occurence:

, Buns for hamburger

Copy link
Member

@alexgarel alexgarel Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see, I'll try to comment, feel free to change:

Suggested change
# should retrieve all "en:blablabla, tag_name" or "it:tag_name"
tag_regex = re.compile(f'\n([a-z][a-z]:(?:[\w\s\-\']*\,-)*{tag_name})[,|\n]')
# should retrieve all "en:blablabla, tag_name" or "it:tag_name"
# the prefix is either the language or a comma.
# Suffix is either an end of line or a comma
tag_regex = re.compile(f'\n([a-z][a-z]:(?:[\w\s\-\']*\,-)*{tag_name})[,|\n]')

Do you know you can also use positive lookahead or lookbehing ((?=...) and (?<=...))

Or better just capture with a name the part you are interested in:

tag_regex = re.compile(f'\n([a-z][a-z]:(?:[\w\s\-\']*\,-)*(?P<tag_name>{tag_name})[,|\n]')

Then you can use the match.group("tag_name") to get your tag_name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting

Actual regex captures the whole block: "([a-z][a-z]:(?:[\w\s\-\']*\,-)*({tag_name})"
It can be "en:blablabla, tag_name" or "it:tag_name", for example.

In the next lines, it checks if it starts with "en:" or another language, and if not it keeps the first element of the list of synonyms (for example, if we looked for "Alga spirulina" and it found "it:Spirulina, Alga spirulina", then it will keep "it:Spirulina")

Eventually, we could put "(?P[a-z][a-z]:(?:[\w\s-'],-)(?P<tag_name>{tag_name})", but maybe it is good as it is already now.

Comment on lines 326 to 332
if dev:
i = 0
for product in all_products_for_tag["products"]:
if dev:
if i > 0:
break
i += 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok to always count, and it's easier to read :-)

Suggested change
if dev:
i = 0
for product in all_products_for_tag["products"]:
if dev:
if i > 0:
break
i += 1
for i, product in enumerate(all_products_for_tag["products"]):
if dev and i > 0:
break

Comment on lines 214 to 219
# only save possible new tags to be reviewed and added
with open("update_tags_per_languages_possible_new_tags_" + tag_type, "w") as f:
f.write("possible_new_tags")
with open("update_tags_per_languages_possible_new_tags_" + tag_type, "a") as f:
for possible_new_tag in possible_new_tags:
f.write("\n" + possible_new_tag)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The search can be long, so I would save all the data you found out in a json file (for each map).

This way you could add an option to only load those json file and process them (if they exists), avoiding rerunning the search phase in case the update phase crashes.

Comment on lines 360 to 361
with open(logs_file, "a") as f:
f.write(f"{product['code']},{plural};{current_tag};{updated_tag}\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can also keep the file open, and use f.flush() to ensure you writes are immediately written on disk.

I find it great that you write what is processed. You could check for this file to avoid re-processing things already done in case the script failed at some point.

],
}
file = os.path.abspath(os.path.join(os.path.dirname( __file__ ), '..', f'taxonomies/categories.txt'))
possible_wrong_language_tags = unknown_tags_taxonomy_comparison(all_tags_dict, file, "categories")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great to have tested this function


logs_file = "update_tags_per_languages_logs"
if os.path.isfile(logs_file):
os.remove(logs_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe not if you want to continue an old failed script.

@benbenben2
Copy link
Collaborator Author

@alexgarel

I applied your suggestions, script now:

  • can be run with input parameters
  • can be resumed if it crashes

The code was run for a single tag in .net environment. The file update_tags_per_languages_wrong_languages_detected_categories contained only:

current tag, :found tag
en:ядки,bg:ядки

The script was interrupted before the end of the update and run again. It resumed the updates.

Before
Screenshot_20240217_145520

After
Screenshot_20240217_171819

Note that it did not add new tag, because the tag was already listed in English:

<en:Nuts and their products
en:Nuts, Fruit nuts
bg:Ядки

Copy link
Member

@alexgarel alexgarel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @benbenben2 !

Comment on lines 248 to 249
for possible_new_tag in possible_new_tags:
f.write("\n" + possible_new_tag)
output_possible_new_tag_file.write("\n" + possible_new_tag)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: More pythonic:

        output_possible_new_tag_file.write("\n".join(possible_new_tags)) 

scripts/update_tags_per_languages.py Outdated Show resolved Hide resolved
scripts/update_tags_per_languages.py Outdated Show resolved Hide resolved
scripts/update_tags_per_languages.py Outdated Show resolved Hide resolved
Co-authored-by: Alex Garel <alex@openfoodfacts.org>
Copy link

sonarqubecloud bot commented Mar 5, 2024

Quality Gate Passed Quality Gate passed

Issues
21 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@alexgarel alexgarel merged commit 0986951 into main Apr 2, 2024
12 checks passed
@alexgarel alexgarel deleted the dq_correct_wrong_lang__for_tags branch April 2, 2024 08:10
john-gom pushed a commit that referenced this pull request May 24, 2024
Python code to: 
 - fetch unknown tags for categories, allergens, etc.
 - check if each unknown tag exists in the taxonomy in different language
 - if yes, to all products having the tag in wrong language:
    - add existing tag in different language 
    - remove tag in wrong language 
   
Code is generalized to - not only categories, but also - additional tags fields.

- Part of #7838
---------

Co-authored-by: Alex Garel <alex@openfoodfacts.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🧽 Data quality https://wiki.openfoodfacts.org/Quality
Projects
None yet
3 participants