Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework Non i.i.d. data notebook #786

Open
ArturoAmorQ opened this issue Dec 10, 2024 · 1 comment
Open

Rework Non i.i.d. data notebook #786

ArturoAmorQ opened this issue Dec 10, 2024 · 1 comment

Comments

@ArturoAmorQ
Copy link
Collaborator

The plot showcasing the use of a ShuffleSplit strategy Non i.i.d. data notebook changed in a recent version of the pandas plotting utility:

In previous versions:
shuffle_1

Now:
shuffle_2

We could use the opportunity to rework this whole notebook, as mentioned in #784 (review):

  • Not use groups with TimeSeriesSplit (it's currently rising a UserWarning)
  • Use a more realistic dataset (optional)
  • Give interpretation to resulting R2 and MSE scores:
    • results are above or below chance level?
    • they are over optimistic when not evaluated properly
    • predictions are not realistic, when using the current dataset, a simple DecisionTreeRegressor can foresee a sudden drop in quotes
  • Mention the actual good practices for modeling, e.g. aligning the test size of TimeSeriesSplit with the forecasting task
  • In general give more focus to the use of TimeSeriesSplit
@glemaitre
Copy link
Collaborator

I wanted to report this issue because I just saw the plot. A hot fix for the plot is to make sure that the data point are ordered by increasing date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants