Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically load dataset as pandas #1251

Open
mfeurer opened this issue May 17, 2023 · 0 comments
Open

Automatically load dataset as pandas #1251

mfeurer opened this issue May 17, 2023 · 0 comments
Labels
Data OpenML concept enhancement

Comments

@mfeurer
Copy link
Collaborator

mfeurer commented May 17, 2023

This issue is a proposal that we (1) load datasets as pandas by default and (2) rewrite the dataset loader to be pandas by default and convert to numpy if the user requests a numpy array.

The reasons for this proposal are:

  1. pandas is much more stable as it used to be a few years ago when we started this project and can now also properly handle strings (see Proposal: Use pandas str type for str datasets #1107).
  2. pandas can properly encode categorical columns, which can make it easier for projects building on OpenML-Python to handle these categories.
  3. We will use parquet in the background to store files anyway, which has to be interfaced with pandas.
@PGijsbers PGijsbers added enhancement Data OpenML concept labels May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data OpenML concept enhancement
Projects
None yet
Development

No branches or pull requests

2 participants