-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distilled ZairaChem models in ONNX format #32
Comments
We will start by testing again Olinda @JHlozek (see: ersilia-os/olinda#3) |
I've been working on this and currently I have Olinda installed in the ZairaChem environment (requiring some dependency conflict resolution as usual). There are still many improvements to work on next, including:
|
Thanks @JHlozek this is great. |
Olinda updates: As a test, I trained a model on H3D data up to June 2023 including 1k pre-calculated reference descriptors and then predicted prospective compounds from the next 6 months. Here is the scatter plot of how the distilled and zairachem model predictions compare and a ROC-curve for the distilled model on prospective data. To facilitate testing, I have written code that will prepare a folder of pre-calculated descriptors for 1k molecules, which can be run in the first cell in the demo notebook.
I suggest testing this and then closing #3 to keep the conversation centralized here. |
Next steps:
|
This is very interesting and promising, @JHlozek ! |
Summary of the weekly meeting: the distilled models look good but there seems to be a bit of underfitting as we add external data, so we need to make the ONNX model a bit more complex. |
Hi @JHlozek I have a dataset that contains IC50 data for P.Falciparum, over 17K molecules with Active (1) and Inactive (0) defined at two cut-offs (hc = high cut-off, 2.5 uM / lc = low cut-off, 10 uM). They are curated from ChEMBL - all public data |
Motivation
ZairaChem models are large and will always be large, since ZairaChem uses an ensemble-based approach. Nonetheless, we would like to offer the opportunity to distill ZairaChem models for easier deployment, especially in online inference. We'd like to do it in an interoperable format such as ONNX.
The Olinda package
Our colleague @leoank already contributed a fantastic package named Olinda that we could, in principle, use for this purpose. Olinda takes an arbitrary model (in this case, a ZairaChem model) and produces a much simpler model, stored in ONNX format. Olinda uses a reference library to do the teacher/student training and is nicely coupled with other tools that @leoank developed such as ChemXOR for privacy-preserving AI and Ersilia Compound Embedding which provides dense 1024-dimensional embeddings.
Roadmap
The text was updated successfully, but these errors were encountered: