-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aliases now also work for nested fields; Only retrieve data required for constructing a response from the database. #1304
base: main
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1304 +/- ##
==========================================
+ Coverage 91.35% 91.51% +0.15%
==========================================
Files 74 74
Lines 4374 4454 +80
==========================================
+ Hits 3996 4076 +80
Misses 378 378
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
I still made some edits in this area during my stay in Barcelona, so I would like to add those changes to this PR as well. |
… 2.Now only the requested fields are retrieved from the backend.
…d again eventhough they had already been determined in find in entry_collections.py
…to trigger tests.
… and removed fallback value for get_non_optional_fields.
… and removed fallback value for get_non_optional_fields.
Is this ready for review now then @JPBergsma ? |
Yes, you can review it now. I have also updated the text at the start of this PR, so you may want to reread it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the hard effort on this @JPBergsma! I've focused this first review on the response fields aspect without looking too deeply at the aliases. Whilst the nested aliases would be nice, we should make sure it doesn't effect performance for server's that don't use them (perhaps just hiding all of that functionality behind a switch based on whether there are dotted aliases in the server config).
I agree that this is definitely two separate PRs, but understand if it now too hard to disentangle them.
I think selecting which fields to grab from the database could be useful going forward (though I guess in the trajectories case they might be so big as to live in a different collection/table anyway?), however I think the current implementation falls a bit short in terms of performance and changes to the existing Python API that others have built their servers on. Whilst we shouldn't feel completely married to the existing way we do things (i.e., we can introduce breaking changes where necessary before v1), we should bear in mind that this will probably be a prohibitive blocker for APIs using this package (Materials Project for example is still using v0.13.3 from 18 months ago.
for result in results: | ||
for field in include_fields: | ||
set_field_to_none_if_missing_in_dict(result["attributes"], field) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to incur a significant performance overhead, but I guess you want to do it so that you don't have to pull e.g., entire trajectories from the database each time, yet you still want to deserialize the JSON into your classes? I think I would suggest we instead have a per-collection deserialization flag, as presumably you only want to deserialize trajectories once (on database insertion) anyway. Does that make sense?
If you want to retain this approach, it might be cleaner to do it at the pydantic level, e.g., a root_validator
that explicitly sets all missing fields to None
(see also https://pydantic-docs.helpmanual.io/usage/models/#creating-models-without-validation as an option).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this is particularly heavy compared to all the other things we do in the code.
It is what we previously did in handle_response_fields. I only moved it here, so we can do it before the deserialization and added code to handle nested fields.
For biomolecular data, a structure can easily have 10,000 atoms, so retrieving them from the database and putting them in the model would take some time. This way we can avoid that if the species_at_sites and cartesian_site_positions are not in the response_fields. (I also made a patch in the code for Barcelona that allowed them to specify the default response fields, so they can choose to not have these fields in the response by default.)
I did not want to make the change even bigger by bypassing the rest of the validator (as in your second suggestion).
But from a performance viewpoint, bypassing the validation would be good.
Do you want me to add this to this PR or to a future PR ?
I tried the root validator idea, but it seems I already get an error before the root_validator is executed, so I do not think this solution will work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, fair enough, just looks a bit scarier as a double for loop alongside the recursive descent into dictionaries to get the nested aliases. It's quite hard to reason about this, so I might set up a separate repo for measuring performance in the extreme limits (1 structure of 10000 atoms vs 10000 structures of a few atoms -- i.e., what we have now, ignoring pagination of course).
I tried the root validator idea, but it seems I already get an error before the root_validator is executed, so I do not think this solution will work.
Did you use @root_validator(pre=True)
to make sure it gets run before anything else? Perhaps bypassing the validation altogether can wait for another PR, as you suggest, but I'd like to make the performance tests first at least (doesn't necessarily hold up this PR but it might hold up the next release).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed I made a mistake in my test script.
So it may be possible to do this with a root_validator after all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a root_validator to the attributes class that, if a flag is set, checks whether all required fields are present and if not adds them and sets them to 0.
I'll try to put the handling of the other include fields back to the place where it happened originally so the code changes less.
+ cls.ALIASES | ||
] | ||
+ list(cls.PROVIDER_FIELDS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff is getting very confused between these similar functions, could you add the new stuff below instead all_aliases
instead?
def deserialize( | ||
cls, results: Union[dict, Iterable[dict]] | ||
) -> Union[List[EntryResource], EntryResource]: | ||
if isinstance(results, dict): | ||
return cls.ENTRY_RESOURCE_CLASS(**cls.map_back(results)) | ||
def add_alias_and_prefix(cls, doc: dict) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue here with the diff, it is hard to see the deserialize has also been changed as well as moved. Could you put your new stuff underneath it?
def deserialize(cls, results: Iterable[dict]) -> List[EntryResource]: | ||
return [cls.ENTRY_RESOURCE_CLASS(**result) for result in results] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just generally I don't like that we are changing the functionality of existing methods (no longer doing the mapping) like deserialize
as mappers are used elsewhere beyond just the server (and potentially by other users).
def handle_response_fields( | ||
def remove_exclude_fields( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the removal of handle_response_fields
really necessary? Again, my server code uses this independently of this package. We need to be careful not to force people into a rewrite of their existing stuff for the sake of new features (that they may or may not use)
@@ -105,6 +105,7 @@ | |||
"pydantic~=1.10,>=1.10.2", | |||
"email_validator~=1.2", | |||
"requests~=2.28", | |||
"pyyaml>=5.4, <7", # Keep at pyyaml 5.4 for aiida-core support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you pin down which part of the non-server stuff needs pyyaml? I assume it was just for reading the config files in yaml format previously
Co-authored-by: Matthew Evans <7916000+ml-evs@users.noreply.github.com>
925a5b7
to
1e49fb8
Compare
0a551aa
to
596bb73
Compare
@root_validator(pre=True) | ||
def set_missing_to_none(cls, values): | ||
if "set_missing_to_none" in values and values.pop("set_missing_to_none"): | ||
for field in cls.schema()["required"]: | ||
if field not in values: | ||
if ( | ||
field == "structure_features" | ||
): # It would be nice if there would be a more universal way to handle special cases like this. | ||
values[field] = [] | ||
else: | ||
values[field] = None | ||
return values | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks on the right track! I've just played around with something too after spotting something in the pydantic docs about default factories. Would the snippet also solve this issue?
>>> from pydantic import BaseModel, Field
>>> from typing import Optional
>>> class Model(BaseModel):
... test: Optional[str] = Field(default_factory=lambda: None)
...
>>> Model()
Model(test=None)
>>> Model(test=None)
Model(test=None)
>>> Model(test="value")
Model(test='value')
>>> Model.schema()
{'title': 'Model', 'type': 'object', 'properties': {'test': {'title': 'Test', 'type': 'string'}}}
We can then patch the underlying OptimadeField
and StrictField
wrappers to default to having a default_factory
that returns null in cases where there is no default value for the field to fall back on, and we can do this without modifying the schema or the models.
The only concern is that this functionality might get removed from pydantic:
The default_factory argument is in beta, it has been added to pydantic in v1.5 on a provisional basis. It may change significantly in future releases and its signature or behaviour will not be concrete until v2.
Though it has already lasted a few versions.
I have put some of the changes I made in Barcelona and some more things in this PR.
I got a bit carried away and in hindsight I should probably have split it up in two or more PRs as it has become quite large and involves changes that could have been separated.
I am also still a bit worried that I may have changed too much and that there could be problems with code depending on the optimade python tools.
It now allows mapping to arbitrary nesting levels as long as the fields are separated by ".", which is similar to how they are handled in MongoDB.
Example:
"aliases": { "structures": {"OPTIMADE_field": "field.nested_field"} }
It also allows adding database specific fields to top level optimade dictionaries like: "species"
So if you add "species.oxidation_state" to the provider_fields, it will be presented as "species._exmpl_oxidation_state"
The mapping is now done in two steps. One step for removing/adding the prefixes and one step for mapping the optimade field to the backend specific field.
The
all_aliases
method therefore now only returns the mapping between the optimade fields and the backend fields.The
all_prefixed_fields
now contains the pairs of prefixed and unprefixed optimade fields.Another change that is in this PR is that only the requested fields are now retrieved from the Mongo database. This caused some issues with validation, because fields that are required for a normal response are now no longer present when the
response_fields
parameter is present. I there for had to make the validation less strict in a few places.