Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distilled ZairaChem models in ONNX format #32

Open
2 tasks
miquelduranfrigola opened this issue Dec 5, 2023 · 8 comments
Open
2 tasks

Distilled ZairaChem models in ONNX format #32

miquelduranfrigola opened this issue Dec 5, 2023 · 8 comments
Assignees

Comments

@miquelduranfrigola
Copy link
Member

miquelduranfrigola commented Dec 5, 2023

Motivation

ZairaChem models are large and will always be large, since ZairaChem uses an ensemble-based approach. Nonetheless, we would like to offer the opportunity to distill ZairaChem models for easier deployment, especially in online inference. We'd like to do it in an interoperable format such as ONNX.

The Olinda package

Our colleague @leoank already contributed a fantastic package named Olinda that we could, in principle, use for this purpose. Olinda takes an arbitrary model (in this case, a ZairaChem model) and produces a much simpler model, stored in ONNX format. Olinda uses a reference library to do the teacher/student training and is nicely coupled with other tools that @leoank developed such as ChemXOR for privacy-preserving AI and Ersilia Compound Embedding which provides dense 1024-dimensional embeddings.

Roadmap

  • Proof of principle: We first need to show that model distillation can be done based on ZairaChem's checkpoints.
  • Distillation module in ZairaChem: Then, we need to add a distillation module in ZairaChem, to be run at the end of model training, that performs the distillation procedure as part of the training pipeline.
@GemmaTuron
Copy link
Member

We will start by testing again Olinda @JHlozek (see: ersilia-os/olinda#3)

@JHlozek
Copy link
Collaborator

JHlozek commented Jun 12, 2024

I've been working on this and currently I have Olinda installed in the ZairaChem environment (requiring some dependency conflict resolution as usual).
I have a version of the code that can invoke the ZairaChem pipeline and collect its output to complete the pipeline, so in principle this works.

There are still many improvements to work on next, including:

  • suppression of ZairaChem output
  • use of pre-calculated descriptors
  • merge model training set with chembl data for Olinda training set

@miquelduranfrigola
Copy link
Member Author

Thanks @JHlozek this is great.

@JHlozek
Copy link
Collaborator

JHlozek commented Jul 4, 2024

Olinda updates:
The ZairaChem distillation process runs successfully, now with pre-calculated descriptors too.
This includes the above points of suppressing the extensive output produced by ZairaChem and merging the model training set with the pre-calculated reference descriptors.

As a test, I trained a model on H3D data up to June 2023 including 1k pre-calculated reference descriptors and then predicted prospective compounds from the next 6 months. Here is the scatter plot of how the distilled and zairachem model predictions compare and a ROC-curve for the distilled model on prospective data.

Image

Image

To facilitate testing, I have written code that will prepare a folder of pre-calculated descriptors for 1k molecules, which can be run in the first cell in the demo notebook.
For testing, perform the following steps:

  • Install ZairaChem with the following dependency changes:
    install_linux.txt(change to .sh file)
    requirements.txt
  • Install Olinda into the ZairaChem conda environment with:
    python3 -m pip install git+https://github.com/JHlozek/olinda.git
    It seems like there is an issue in docker with mixed git+https links which I fixed by downgrading requests==2.29.0
  • Install jupyter:
    conda install -c conda-forge jupyterlab
  • Open olinda/notebooks/demo_quick.ipynb and run the first cell to produce the pre-calculated descriptors for 1k molecules
  • Update the paths in the Distillation section for the ZairaChem model to be distilled and the save path for the ONNX model
  • Run all the Distillation cells

I suggest testing this and then closing #3 to keep the conversation centralized here.
I'll post next steps following this.

@JHlozek
Copy link
Collaborator

JHlozek commented Jul 4, 2024

Next steps:

  • Performance testing for different sized pre-calculated sets (1k, 10k, 100k) both with and without the ZairaChem training set
  • Finish calculating 100k descriptor sets (just eos4u6p to go)
  • The base pipeline's output is in a quite heavily nested list format which could be tweaked to improve usability
  • Investigate speed improvements. The following currently still need to be calculated at runtime:
    - mellody-tuner
    - treated descriptors (only minor)
    - manifolds
    - tabpfn
    - eos59rr
    - eos6m4j

@miquelduranfrigola
Copy link
Member Author

This is very interesting and promising, @JHlozek !
There seems to be a tendency towards false negatives (upper-left triangle in your plot). This is interesting and hopefully can be ameliorated with (a) more data and/or (b) including the training set.
Great progress here! Exciting

@GemmaTuron
Copy link
Member

Summary of the weekly meeting: the distilled models look good but there seems to be a bit of underfitting as we add external data, so we need to make the ONNX model a bit more complex.
In addition, we will look for data to validate the "generizability" of the model - from H3D data (@JHlozek ) and ChEMBL (@GemmaTuron )

@GemmaTuron
Copy link
Member

Hi @JHlozek

I have a dataset that contains IC50 data for P.Falciparum, over 17K molecules with Active (1) and Inactive (0) defined at two cut-offs (hc = high cut-off, 2.5 uM / lc = low cut-off, 10 uM). They are curated from ChEMBL - all public data
I do not have the strain (it is a pool) but we can assume most of it will be in sensitive strains, and likely NF54.
Let me know if these are useful!

pfalciparum_IC50_hc.csv

pfalciparum_IC50_lc.csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Status: Todo
Development

No branches or pull requests

3 participants