Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Retrofitting the standard for an open AI audience #58

Closed
nathanbaleeta opened this issue Mar 2, 2021 · 4 comments
Closed
Assignees

Comments

@nathanbaleeta
Copy link
Contributor

nathanbaleeta commented Mar 2, 2021

The current standard attempts to address the needs for indicator 6 (Mechanism for Extracting Data) better for a software context than AI. There is still room for being more explicit. This issue seeks to outline key questions of concern regarding open ai models and data extraction mechanisms for non-personally identifiable information:

  1. In the context of an open AI model, what qualifies to become non-personally identifiable information?
  • Model weights & parameters etc.
  1. Describe the mechanism for extracting or importing non-personally identifiable information from the system in a non-proprietary format.(The answers below are my thoughts on what possible answers this question would attract in an AI context)
    Model persistence or serialization can occur through:
  • For Scikit learn, save the model using Pickle (standard Python objects) or Joblib (efficiently serializing Python objects with NumPy arrays).
  • For Keras and Tensorflow, save the model in HDF5 format with .h5 extension.
  • For PyTorch, conventional approaches include Pickle, using either a .pt or .pth file extension etc.

PS: While for software the current wording in indicator 6 suffices, for AI models including the following keywords would make it clearer: model persistence/ serialization.

@nathanbaleeta nathanbaleeta changed the title [Proposal Retrofitting the standard for an open AI audience [Proposal] Retrofitting the standard for an open AI audience Mar 4, 2021
@Lucyeoh
Copy link
Contributor

Lucyeoh commented Apr 15, 2021

Status & Next Steps:

  • Engage an AI expert to take a look at the current standard and this proposal (specific to non PII data & data privacy).

@Lucyeoh
Copy link
Contributor

Lucyeoh commented Apr 15, 2021

Prioritization: should come after #59

@prajectory
Copy link
Contributor

prajectory commented May 12, 2022

We have expert inputs from Lea Gimpel on how the DPG standard can be better retrofitted for Open AI digital solutions.

Reproducibility:
Means that all training details are given: Needless to say, this includes a description of the data, code documentation and tech stack documentation (these can follow the already existing standards and criteria). We think it should also include specific model-training documentation. For instance, what kind of CPU/GPU, OS and platforms (cloud provider, google collab, etc.) were used for the training and listing all training parameters. Ideally, a tech-savvy person should be able to re-train the model with identical evaluation scores, given all information, data and computing power.

A nice way to think about transparent documentation of AI models is also Google’s idea of “model cards” (see here and here the corresponding article; in addition, Timnit Gebru also suggested “datasheets for datasets”, which could be an interesting tool for the discussion around open data as DPG)

Accessibility:
This is quite critical, in our opinion. The model should be easily accessible and usable. A good solution may be the provision of an API. You can send your request and retrieve the prediction outcome through a stable connection in real-time. Here platforms such as Hugging Face are also quite handy since they allow one-liner-code access and usage of trained ML models. (btw they just received 2b$ funding aiming to build the GitHub of Machine Learning)

Interpretability:
We think it is essential that the prediction outcomes of the models are interpretable and understandable, at least through proper documentation and explanation. For traditional ML models, predictions should be accompanied by some sort of intuitive confidence scores. It is a difference if the models predict with 99% confidence or 51% confidence. If such thresholds are set, they need to be clearly stated and explained.
Generally, it should be clear what problem the AI model aims to solve and what realistic outcomes/performance the user can expect.

Independency:
This adds to the point of accessibility. We may also make the model accessible through some sort of package as a collection of modules that can be downloaded and used in a programming language such as python. (pip install our_packaged_model) or as a sub-module in an existing package (this is how it would work if pushed to the Hugging Face model hub). The point of independency is that the dependencies need to follow the same standards as the end-product, but that’s also already outlined in the standard.

@prajectory
Copy link
Contributor

We will resolve this on #130 latest on the topic of AI as a part of the standard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Merge/Duplicate
Development

No branches or pull requests

3 participants