Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using MetPy to split up testing/training/validation xarray datasets for Machine Learning #3579

Open
ThomasMGeo opened this issue Jul 22, 2024 · 6 comments
Labels
Type: Feature New functionality

Comments

@ThomasMGeo
Copy link

What should we add?

Creating testing/training/validation datasets is a key step in machine learning workflows. Usually for Climate/Weather ML analysis, we split these datasets on a time dimension.

Scikit-learn has a function that does this for 2D arrays / pandas dataframes here. This function can't split xarray datasets.

Improvements on the scikit-learn implementation:

  1. Built for xarray datasets
  2. Can create a validation dataset (a third dataset) instead of doing it in two lines
  3. Can split datasets up in a useful way for time series analysis (do not split up datasets randomly for time series analysis!)

Big questions:

  1. Where should this go?
  2. can we use Xr.dataset.parse_cf() in a smart way to pull the time dimension automagically? This might not be required anyways.

Reference

No response

@ThomasMGeo ThomasMGeo added the Type: Feature New functionality label Jul 22, 2024
@ThomasMGeo
Copy link
Author

@anacmontoya and I have been working on a function/notebook that might be a good starting point for this work.

@anacmontoya
Copy link
Contributor

https://gist.github.com/anacmontoya/35156d81fec1fe790b67916d2339d793

Here's the code!

@sethmcg
Copy link

sethmcg commented Jul 24, 2024

  1. As a completely naive user, I would expect to find this functionality in the Xarray integration section.

  2. That seems like it should be easy. If the data is CF-compliant, you can look first for the coordinate with an axis attribute of "T", then for one with standard_name "time".

@sethmcg
Copy link

sethmcg commented Jul 24, 2024

As for what to add, a few things jump out at me from a climate modeling perspective:

  • For the sake of later analysis, we usually split things based on dates, not proportions. So I'd like the ability to specify the splits as, e.g., training = 1980-2012, validate = 2013-2017, test = 2018-2022.
  • Or 1979-10-01 through 2011-09-30, etc. if you're using water years. So while sometimes you need to specify a full datetime for the split point, it would be nice to be able to just give the year (or year+month) and have it automatically promote from year to year+month to date to date+time as needed.
  • Climate models often use non-standard calendars, so datetimes should be handled as cftime objects, rather than np.datetime64 objects.
  • Another possibility is that sometimes you want to do things like train on even years and validate on odd years, and then hold out some other subset for testing, like a chunk at the end, or maybe years divisible by 5.
  • Or you might want to split randomly by year, but ensure that each split has a decent sampling of the different ENSO phases. (I.e., conditioning the splits based on some external factor.)

I don't think MetPy needs to fully support all of these, but it would be good to have a way of specifying the splits that could accommodate them.

@ThomasMGeo
Copy link
Author

All great points!

In the inception of this function, was mainly trying to match the scikit-learn interface/output of train_test_split but for xarray.

Most of your requests I think are straightforward enough using .isel (could argue the same for the proposed function :) ).

I do like the idea of adding even/odd year, or more advanced sampling that is not as easily done.

@sethmcg
Copy link

sethmcg commented Jul 24, 2024

I think a lot of it could be handled by just allowing the user to specify a list of elements for each split instead of the boundaries. What would be really keen is if those lists could contain just years, instead of the full set of datetimes within each year.

The next step beyond that would then be to allow the user to change the date when the year begins/ends, so that you could use water years or winters or whatever depending on what you're studying...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature New functionality
Projects
None yet
Development

No branches or pull requests

3 participants