-
-
Notifications
You must be signed in to change notification settings - Fork 396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: dq_correct_wrong_lang__for_tags #9581
Conversation
Quality Gate passedThe SonarCloud Quality Gate passed, but some issues were introduced. 16 New issues |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #9581 +/- ##
==========================================
+ Coverage 49.10% 49.24% +0.14%
==========================================
Files 66 66
Lines 20451 20532 +81
Branches 4910 4944 +34
==========================================
+ Hits 10042 10112 +70
Misses 9132 9132
- Partials 1277 1288 +11 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benbenben2 that's a really great job !
I have no blocker, so I approve.
But one think I would advise to do is to be able to be able to recover in case of incident:
That is:
- after the search phase, store results in json
- eventually skip search phase if I have json results and jump to update
- use the already updated products information you log to skip already done updates.
scripts/update_tags_per_languages.py
Outdated
dev = True | ||
if dev: | ||
env = "net" | ||
user = "off:off@" | ||
else: | ||
env = "org" | ||
user = "" | ||
|
||
headers = { | ||
'Accept': 'application/json', | ||
'User-Agent': 'UpdateTagsLanguages', | ||
} | ||
data = { | ||
'user_id': '', | ||
'password': '', | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use argparse for this
scripts/update_tags_per_languages.py
Outdated
tags_list_url = f"https://{user}world.openfoodfacts.{env}/{plural}.json" | ||
|
||
# by default the query return 24 results. | ||
# increase to 1000 (so far chips in EN (should be crisps in EN) | ||
# had max number of products for categories with ~550) | ||
products_list_for_tag_url = f"https://{user}world.openfoodfacts.{env}/{singular}/{{tag_id_placeholder}}.json?page_size=1000" | ||
|
||
file = os.path.abspath(os.path.join(os.path.dirname( __file__ ), '..', f'taxonomies/{plural}.txt')) | ||
|
||
# country is needed otherwise <plural>_lc will be "en" | ||
post_call_url = f"https://{user}{{country}}.openfoodfacts.{env}/cgi/product_jqm2.pl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not to change) Just in case you care typically this kind of globals would better fits being attributes of a class, where you then call a method to process one tag.
class ProcessTag:
def init(self, plural, singular):
self.tags_list_url = …
...
def process(self):
all_tags = self.get_from_api()
etc.
scripts/update_tags_per_languages.py
Outdated
# should retrieve all "en:blablabla, tag_name" or "it:tag_name" | ||
tag_regex = re.compile(f'\n([a-z][a-z]:(?:[\w\s\-\']*\,-)*{tag_name})[,|\n]') | ||
|
||
with open(file, "r") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have used the json of taxonomy (like https://static.openfoodfacts.org/data/taxonomies/categories.json) and loaded it in memory. This way you don't have to parse anything but more iterate on a map.
But I would do it once and for all outside the for loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not show the synonyms in the json.
I would keep the taxonomy file locally instead.
scripts/update_tags_per_languages.py
Outdated
# should retrieve all "en:blablabla, tag_name" or "it:tag_name" | ||
tag_regex = re.compile(f'\n([a-z][a-z]:(?:[\w\s\-\']*\,-)*{tag_name})[,|\n]') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we include a comma in it ? Why don't we stop on commas ?
Is this to ensure, tag_name appear in a synonym list ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f'\n([a-z][a-z]:(?:[\w\s-'],-){tag_name})[,|\n]'
There are 2 commas in the regex.
if you have:
en:Hamburger buns, Burger buns, Rolls for hamburger, Buns for hamburger
and you are searching for "Rolls for hamburger"
the first comma in the regex would ignore the comma at the end of
en:Hamburger buns, Burger buns,
the second comma in the regex would ignore the comma right after the occurence:
, Buns for hamburger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see, I'll try to comment, feel free to change:
# should retrieve all "en:blablabla, tag_name" or "it:tag_name" | |
tag_regex = re.compile(f'\n([a-z][a-z]:(?:[\w\s\-\']*\,-)*{tag_name})[,|\n]') | |
# should retrieve all "en:blablabla, tag_name" or "it:tag_name" | |
# the prefix is either the language or a comma. | |
# Suffix is either an end of line or a comma | |
tag_regex = re.compile(f'\n([a-z][a-z]:(?:[\w\s\-\']*\,-)*{tag_name})[,|\n]') |
Do you know you can also use positive lookahead or lookbehing ((?=...)
and (?<=...)
)
Or better just capture with a name the part you are interested in:
tag_regex = re.compile(f'\n([a-z][a-z]:(?:[\w\s\-\']*\,-)*(?P<tag_name>{tag_name})[,|\n]')
Then you can use the match.group("tag_name") to get your tag_name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting
Actual regex captures the whole block: "([a-z][a-z]:(?:[\w\s\-\']*\,-)*({tag_name})"
It can be "en:blablabla, tag_name" or "it:tag_name", for example.
In the next lines, it checks if it starts with "en:" or another language, and if not it keeps the first element of the list of synonyms (for example, if we looked for "Alga spirulina" and it found "it:Spirulina, Alga spirulina", then it will keep "it:Spirulina")
Eventually, we could put "(?P[a-z][a-z]:(?:[\w\s-'],-)(?P<tag_name>{tag_name})", but maybe it is good as it is already now.
scripts/update_tags_per_languages.py
Outdated
if dev: | ||
i = 0 | ||
for product in all_products_for_tag["products"]: | ||
if dev: | ||
if i > 0: | ||
break | ||
i += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ok to always count, and it's easier to read :-)
if dev: | |
i = 0 | |
for product in all_products_for_tag["products"]: | |
if dev: | |
if i > 0: | |
break | |
i += 1 | |
for i, product in enumerate(all_products_for_tag["products"]): | |
if dev and i > 0: | |
break |
scripts/update_tags_per_languages.py
Outdated
# only save possible new tags to be reviewed and added | ||
with open("update_tags_per_languages_possible_new_tags_" + tag_type, "w") as f: | ||
f.write("possible_new_tags") | ||
with open("update_tags_per_languages_possible_new_tags_" + tag_type, "a") as f: | ||
for possible_new_tag in possible_new_tags: | ||
f.write("\n" + possible_new_tag) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The search can be long, so I would save all the data you found out in a json file (for each map).
This way you could add an option to only load those json file and process them (if they exists), avoiding rerunning the search phase in case the update phase crashes.
scripts/update_tags_per_languages.py
Outdated
with open(logs_file, "a") as f: | ||
f.write(f"{product['code']},{plural};{current_tag};{updated_tag}\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can also keep the file open, and use f.flush()
to ensure you writes are immediately written on disk.
I find it great that you write what is processed. You could check for this file to avoid re-processing things already done in case the script failed at some point.
], | ||
} | ||
file = os.path.abspath(os.path.join(os.path.dirname( __file__ ), '..', f'taxonomies/categories.txt')) | ||
possible_wrong_language_tags = unknown_tags_taxonomy_comparison(all_tags_dict, file, "categories") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great to have tested this function
scripts/update_tags_per_languages.py
Outdated
|
||
logs_file = "update_tags_per_languages_logs" | ||
if os.path.isfile(logs_file): | ||
os.remove(logs_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe not if you want to continue an old failed script.
I applied your suggestions, script now:
The code was run for a single tag in .net environment. The file update_tags_per_languages_wrong_languages_detected_categories contained only:
The script was interrupted before the end of the update and run again. It resumed the updates. Note that it did not add new tag, because the tag was already listed in English:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job @benbenben2 !
scripts/update_tags_per_languages.py
Outdated
for possible_new_tag in possible_new_tags: | ||
f.write("\n" + possible_new_tag) | ||
output_possible_new_tag_file.write("\n" + possible_new_tag) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: More pythonic:
output_possible_new_tag_file.write("\n".join(possible_new_tags))
Co-authored-by: Alex Garel <alex@openfoodfacts.org>
Quality Gate passedIssues Measures |
Python code to: - fetch unknown tags for categories, allergens, etc. - check if each unknown tag exists in the taxonomy in different language - if yes, to all products having the tag in wrong language: - add existing tag in different language - remove tag in wrong language Code is generalized to - not only categories, but also - additional tags fields. - Part of #7838 --------- Co-authored-by: Alex Garel <alex@openfoodfacts.org>
What
Python code to:
Code is generalized to - not only categories, but also - additional tags fields.
Lots of comments in the code, which should be helpful for reviewer and user.
Instructions on how to run the code, test it, run it in dev vs prod, are provided on top of the file.
The code was run successfully - in net environment - for 4 different tags:
Corresponding to the following products that have been updated:
Related issue(s) and discussion