Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discontinue the use of the iris dataset #28

Open
antaldaniel opened this issue Jan 21, 2025 · 2 comments
Open

Discontinue the use of the iris dataset #28

antaldaniel opened this issue Jan 21, 2025 · 2 comments
Labels
help wanted Extra attention is needed

Comments

@antaldaniel
Copy link
Contributor

Editor comments:
The iris dataset is very well-known, but it is also infamous because of its eugenics links.
Since having a good example dataset is very important, would you consider replacing it with another one, like maybe the palmerpenguins one, even if it comes at the cost of adding a (possibly optional) dependency?

@antaldaniel antaldaniel added the help wanted Extra attention is needed label Jan 21, 2025
@antaldaniel
Copy link
Contributor Author

Frankly, @maelle, this is something that is not necessarily going to be fully resolved. I think that this issue has many consequences.

  • iris is the scientific dataset that comes with base R, and the objection should be communicated to the base R team. I think that at least in the basic README a relevant dataset that is available on all R installations is needed.
  • The other R Built-in Data Sets all have their problems. mtcars and ToothGrowth are not well defined. USArrests is about rape victims, I do not like that. PlantGrowth is too simple in strucutre.
  • If somebody digs up a bit more about the mtcars I open ot replace iris with that in the README.
  • open to remove the iris dataset from the vignettes
  • pen to remove the iris datasets from the tests, if there are volunteers to do it. It would require rewriting more than 100 unit tests, and I can commit to gradually do this, but it would be an unjustified burden on the author to do this.

On a less procedural note, I think that most R users do not associate the iris flowers dataset with eugenics. Before adding it extensively to the package, I read about the history of the dataset extensively, and this did not even come up. I would really like to balance the sensitivity of people who may have such connotations and those who are sensitive to renaming things and censoring scientific history. The iris dataset is part of R perhaps since the beginnig, and there are about 2 million R users who are familiar with it. For most of these people, the iris dataset is associated with open data, open source programming and statistics.

I am open to this suggestion but rewriting more than a hundred of unit tests makes this a low priority. In the README, I can only imagine the use of a base R dataset. I will remove iris from the vignettes gradually and replace it with the pinguins or something similar.

@maelle
Copy link
Contributor

maelle commented Jan 21, 2025

Fair enough!

Having something else than iris in user-facing docs would be great, thank you for considering it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants