Improving OpenML perception, accessibility and trust #17

joaquinvanschoren · 2024-01-09T10:25:23Z

joaquinvanschoren
Jan 9, 2024
Maintainer

While OpenML invests a lot of effort into providing high-quality datasets and benchmarks, this is not always visible. This is an initial summary of a short brainstorm on how to further improve the way OpenML is perceived in the community, also compared to other platforms. The goal of this thread is to discuss this further and decide on concrete actions. We would be very happy to help anyone in the community to pick up specific ideas to work on, and we warmly welcome any other ideas or feedback.

Website

Many other platforms have a number of nice features that make it more easily accessible:

Trending datasets (based on novelty and likes), which is an inviting way for people to discover new interesting datasets. OpenML could easily add something like this to the website.
Popular topics: OpenML could also add this. We are currently working on learning the topics automatically, but we should also show them more clearly on the website.
Bring back liking of datasets,tasks,... Currently, this is not implemented on the new website.
Dataset explorer (quick view of the data table, should be easy to do in Dash)
Code snippets on how to load datasets and reproduce runs

Data quality

OpenML datasets are generally high quality but not always perceived as such. Ways to improve this:

Do a big push, with help from students, in improving the dataset descriptions and meta-data
Have a 'human verified' badge for datasets that where checked by a human.
Have a clear (human-verified) synthetic / real badge (not just a tag)
Add features on the website to allow users to improve data quality. E.g. improve descriptions, fix metadata. Give users a reward for doing this, e.g. with a reputation/activity score as in Stack Overflow
Create data quality bots that automatically create a data quality report for datasets and post that on the website.
Dataset discussion boards on the website
Make an official 'Editor' role: people who we trust to change datasets and that have more access to do so.
Add 'datasheets' with structured metadata
Add a 'Dataset review' feature: structured reviews of datasets
List original dataset paper, and/or list datasets in which the dataset is used.

Community visibility

Build trust together with the community:

Add testimonials on the website from well-known machine learning people praising OpenML or highlighting use cases
Join forces with Kaggle/HuggingFace. E.g. map OpenML datasets to equivalents on Kaggle. Link to Kaggle and ask Kaggle to link back to OpenML, e.g. to get more data, see benchmarks,... Find a way to discover matching datasets.
Do an open dataset study (maybe together with Kaggle, HuggingFace) that reviews datasets wrt quality, realism, signal for benchmarking
More and better curated OpenML benchmarking suites, create more excitement about them
Auto-generate notebooks and link them to OpenML datasets, e.g. running a standard dataset analysis or benchmark, so that people can more easily reproduce them.
Allow people to contribute their own notebooks, or link to Kaggle kernels
Make an overview of papers that use OpenML, allow users to add their papers to this list. Make it nice and organized.
Make it more clear that OpenML runs are reproducible, e.g. by generating code/notebooks to do this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenML

Improving OpenML perception, accessibility and trust #17

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

OpenML

Improving OpenML perception, accessibility and trust #17

joaquinvanschoren Jan 9, 2024 Maintainer

Website

Data quality

Community visibility

Replies: 0 comments

joaquinvanschoren
Jan 9, 2024
Maintainer