Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align datasets to models #46

Open
brandomr opened this issue Sep 11, 2023 · 2 comments
Open

Align datasets to models #46

brandomr opened this issue Sep 11, 2023 · 2 comments

Comments

@brandomr
Copy link
Contributor

Challenge

How can we automatically align models to datasets? Specifically, how can we most effectively align elements of a model to features within datasets.

Currently, models and datasets are profiled separately by MIT and SKEMA. Both datasets and models end up having (optional) groundings which--for each feature of the data or model--tie it to an element in the TA2 Domain Knowledge Graph (DKG). As far as I know, DKG code lives here.

For example, a model may have a compartment called infected which is grounded to

{"url":"https://bioregistry.io/vsmo:0000268","score":0.78,"prefix":"vsmo","identifier":"0000268","curie":"vsmo:0000268","name":"infected","status":"name"}

Let's say there is a dataset that has the feature infections which is grounded to

[{"url":"https://bioregistry.io/apollosv:00000114","score":0.78,"prefix":"apollosv","identifier":"00000114","curie":"apollosv:00000114","name":"infection","status":"name"},{"url":"https://bioregistry.io/ido:0000586","score":0.78,"prefix":"ido","identifier":"0000586","curie":"ido:0000586","name":"infection","status":"name"},{"url":"https://bioregistry.io/ncit:C128320","score":0.76,"prefix":"ncit","identifier":"C128320","curie":"ncit:C128320","name":"Infection","status":"name"}]

There is no intersection between these groundings, but clearly there is a relationship between infected compartment in the model and infections feature in the dataset. This makes it potentially challenging to identify relevant data to use for model calibration/simulation since for calibration you must match data to specific model compartments/elements.

Potential Solutions

  • Embed the groundings for both models and datasets and enable users to perform semantic search over both. This would include embedding dataset and model descriptions. When a user is search for data relevant to their model they would use free text search which would be powered by a semantic backend to surface the most useful data.
  • Create an /align_data_to_model endpoint which, for a given model_id attempts to find relevant data features on an model element to data feature basis. For example, an SIR model's susceptible, infected, and recovered compartments would be automatically matched and ranked to features (potentially from multiple datasets) based on groundings or whatever other information we can efficiently use.

The first approach will fit best inside TDS and is something we may want to do anyway. Vector/semantic search over content besides papers seems quite useful. We could even support semantic code search which would be potentially very useful.

The second approach will fit best inside this repository since it mirrors some of the existing endpoints (e.g. aligning a model to its paper).

Considerations

It is likely that we will need multiple examples of models and datasets for testing and development. Here is an example model which are often referred to as an AMR: ASKEM Model Representation.

Here is an example data card but note that this data card is not in the canonical dataset format for TDS. We can generate/pull some in the appropriate format--but for now at least this helps get a sense of how DKG groundings roughly appear for data.

@ryanholtschneider2
Copy link

During the TA1 working group there were comments that -
It would be nice to be able to try different embedding models.
This is fairly easy but I can make it even easier..
The need for benchmarks - we could think of this as a really fine difficult

Some other conversation points -
Is there even a good grounding and if so, how much does it help?? And if so, can we get to the grounding..?
HMI workflow to to grounding usage linking to help the grounding team..
More complicated model testing..

@brandomr
Copy link
Contributor Author

brandomr commented Oct 2, 2023

Implementing this as an endpoint requires the generation of embeddings over models and datasets which will first be addressed by this TDS issue so is currently blocked

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants