-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validation criterion 3: biographical data #419
Conversation
…iginal data; ensure complete comparison of dates
… & update docstrings
…gs to use sandbox item 2
…hared statements; revisit & rename variables
…fix log formatting exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fLGTM (functionally looks good to me) - maybe try to reduce code duplication?
@click.option( | ||
'-s', | ||
'--sandbox', | ||
is_flag=True, | ||
help=f'Perform all edits on the Wikidata sandbox item {vocabulary.SANDBOX_2}.', | ||
) | ||
def people_cli(catalog, statements, sandbox): | ||
def people_cli(catalog, statements, criterion, sandbox): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there seems to be significant duplication among people_cli/works_cli and add_works_statements/add_people_statements, as well as other functions here (_add_or_reference/_add_or_reference_works) is it worth it to abstract a little to make future changes easier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking the same: these functions were implemented in version 1.
I will reserve a new PR to refactor those parts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #420
LOGGER.info('Dead identifiers dumped to %s', dead_ids_path) | ||
|
||
# Dump Wikidata cache | ||
if dump_wikidata: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part is also repeated a few times, maybe worth to make a class for the cache so it's also easier to switch to a new backend if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea, filed in #421
LOGGER.info( | ||
'Identifiers gathered from Wikidata dumped to %s', wd_ids_path | ||
) | ||
dead, wd_cache = dead_ids(catalog, entity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just set wd_cache to None and call dead_ids once. this will make possible future refactorings easier:
if os.path.isfile(wd_cache_path):
wd_cache = ....
else:
wd_cache = None
dead, wd_cache = dead_ids(catalog, entity)
and update later code to save the new cache if needed. (also see my other comment on having an object to handle all this loading and saving of the cache)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion added to #421
@@ -410,21 +540,27 @@ def links( | |||
of URL domains. Default: ``False`` | |||
:param wd_cache: (optional) a ``dict`` of links gathered from Wikidata | |||
in a previous run. Default: ``None`` | |||
:return: 4 objects | |||
:return: ``tuple`` of 6 objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bit cumbersome to handle, consider using a NamedTuple or a dataclass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree, filed in #422
LOGGER.info("No %s to be added, won't dump to file", log_msg_subject) | ||
|
||
|
||
def _load_wd_cache(file_handle): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason why you removed this function altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we switched to Pickle: pickle.load
replaces the old function
Thanks a lot for the review, really valuable. |
This PR introduces a set of features for biographical (bio) data validation, corresponding to criterion 3, see https://soweego.readthedocs.io/en/latest/validator.html#soweego.validator.checks.bio .
More specifically: