Converting notebooks from COSIMA Cookbook to ACCESS-NRI intake catalog #313

navidcy · 2024-01-08T10:22:46Z

Why we need to have two ways for people to load output? At the moment, after #298, we have tutorials both for the cookbook and for the ACCESS-NRI Intake catalog

https://cosima-recipes.readthedocs.io/en/latest/tutorials.html

Let's just leave the best of the two. Is ACCESS-NRI Intake catalog the future? Let's make that the default in all examples and drop the cookbook then?

cc @angus-g, @aidanheerdegen, @AndyHoggANU, @aekiss, @adele-morrison, @anton-seaice, @PaulSpence, @rmholmes, @edoddridge, @micaeljtoliveira

Examples

The text was updated successfully, but these errors were encountered:

navidcy · 2024-01-08T10:24:12Z

#310 seems relevant

aidanheerdegen · 2024-01-08T10:55:24Z

Largely this is up to the COSIMA community, but it is likely a choice you will be forced to make for a few reasons:

The Cookbook Database does not scale well, particularly in the time taken to index data
The infrastructure which runs the indexing will likely cease to exist by the end of the first quarter of 2024 (the accessdev Virtual Machine is scheduled to be shuttered, and this is what provides the outward facing web address for the jenkins service)
There are no resources to do more than just maintain the Cookbook DB (@micaeljtoliveira does what he can, but he has many other high priority tasks)
ACCESS-NRI has chosen to support the Intake Catalogue (in part for some of the reasons above)

So I'd suggest taking this to a COSIMA meeting and making the community aware that a decision needs to be made, and probably the sooner the better.

One downside of dropping the Cookbook is there is currently no GUI interface for searching the Intake Database. However there are some powerful methods for searching (filtering) the intake catalogue

https://access-nri-intake-catalog.readthedocs.io/en/latest/usage/quickstart.html#data-discovery

Also there are plans for some tools to assist with data discovery from ACCESS-NRI, and we're working on prototypes we hope to share soon.

navidcy · 2024-01-08T11:02:03Z

Thanks @aidanheerdegen!
I'm happy to hear also voices from others here and then take it to a COSIMA meeting -- sounds good!

adele-morrison · 2024-01-08T19:03:06Z

The advice we had from ACCESS-NRI on this back around the workshop last year was that at that time they didn’t advise switching everything to intake catalogue, because it was so new and untested. But @aidanheerdegen @dougiesquire if you think the intake catalogue is ready for us to switch everything over completely, i think we should do that. We discussed this previously at a COSIMA meeting and agreed we should switch when the time was right (which is maybe now?).

anton-seaice · 2024-01-08T21:45:21Z

Not much to add from me:

There hasn't been much uptake of the intake catalogue, so there may still be bugs/missing datasets etc that will need attention.
We should keep the cookbook going in its current state (i.e. without adding new data or updates) for existing users, and keep the cookbook tutorials available for people who are halfway through projects for some time. This could be a simple as a tag on the repository, or a fork/branch of the repo, or a seperate folder in the main branch (depending if we want the tutorials on the read the docs version or are happy with them only in github).
It would be nice to have the test pipeline replaced prior to the switch (which also relies on accessdev), which could make reviewing notebooks more robust too.

dougiesquire · 2024-01-08T22:04:25Z

I think @max-anu has been modifying the existing recipes to use the catalog. It would be good to get an update from him. There are definitely datasets that are in the cookbook but not yet in the catalog. I think @max-anu will probably have the best idea of what these are.

aidanheerdegen · 2024-01-09T01:25:26Z

@anton-seaice makes a good point about retaining existing capability for those who rely on it, identifying deficiencies and setting up a new testing infrastructure.

@dougiesquire is also correct that datasets that are missing from intake should be identified.

@adele-morrison I'm not in a position to say if the intake catalogue is ready or not, but we're rapidly approaching a point where there isn't really an option, so we need to make sure it is.

So, synthesising from above, some steps to consider:

Decide on deprecation plan for cookbook DB (timing to stop ingesting new data and freeze DB, where/how to archive CC DB recipes for medium/long term support)
Set up testing infrastructure for intake recipes
Document datasets missing from intake
Decide which datasets need to be added, particularly if there are any that must be added before swapping to intake
Set proposed time-frame for replacing CC DB recipes with intake versions
Communicate plan to community, and get users testing out intake catalogue to find weak points
Deprecate CC DB and replace with intake recipes in main and docs

Might be good to break this out into a GitHub project, make some issues etc. I would do this, but I'm not in a position to assist much I'm afraid.

Thomas-Moore-Creative · 2024-01-10T04:17:46Z

Could the culture be shifted going forward so that when someone / a group runs an experiment that is meant to be open to the community that creating an Intake-ESM datastore is part of the process of making it available and documented? My view is that it's the "authors" of specific experiments that are best placed to create and update datastores for the catalog?

@dougiesquire appears to have done a great job documenting how that can be done for those who've never built a catalog before > I want to catalog my own data to make it easy to find and load

This new catalog is a big step forward, IMO, but will require a community effort to maintain, in my view.

rbeucher · 2024-01-23T01:34:21Z

Hi All,

So from the MED team perspective, I asked @max-anu to look at converting the recipes to use the intake catalog.
The main reason for going ahead with this was precisely to identify the gaps that are mentioned above (missing functionalities, datasets etc.) and to push for a broader use of the intake-catalog. @max-anu should be able to give us an update.

#310 is supposed to be a draft and everyone is welcome to participate and/or comment on it.

Happy to discuss the plan going ahead.

rbeucher · 2024-01-24T22:26:08Z

I have started a repo on the ACCESS-NRI organisation to automate testing of the recipes:
https://github.com/ACCESS-NRI/COSIMA-recipes-workflow

This is based on what I have set up for the ESMValTool recipes:

https://github.com/ACCESS-NRI/ESMValTool-workflow?tab=readme-ov-file#recipes-current-status

Will keep you posted.

access-hive-bot · 2024-04-24T08:32:49Z

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/workshop-on-community-driven-practices-for-analysing-coding-and-sharing-cosima-hackathon-v4-0/2047/6

dougiesquire · 2024-05-01T04:49:27Z

Will keep you posted.

@rbeucher do you have any updates? It would be great to get an update on where @max-anu has got to so we can try to progress this in the upcoming hackathon.

rbeucher · 2024-05-01T04:56:17Z

@max-anu may have some updates.

AndyHoggANU · 2024-06-05T23:45:56Z

I've had a quick look at this and it seems that many of the changes @max-anu made are in the max-updating-notebooks branch. So, to a large degree this task depends upon taking what has already been done. Would be good to do that in a way which retain's Max's contributions. Does anybody know the best way to do that?

My only suggestion is for others to check the existing branch and then to merge?

rbeucher · 2024-06-05T23:48:11Z

It's possible to split @max-anu work into multiple PRs if that helps for the workshop. We can do one / notebook.

AndyHoggANU · 2024-06-05T23:50:20Z

Yeah, I think that would be perfect!

navidcy · 2024-06-06T00:01:58Z

Agree!!

adele-morrison · 2024-06-06T00:05:20Z

Maybe this could be good practice for the group doing PR review training to hone their skills on during the hackathon?

edoddridge · 2024-06-06T00:16:38Z

It's possible to split @max-anu work into multiple PRs if that helps for the workshop. We can do one / notebook.

@rbeucher - you are much braver at messing about with git than I am. Are you envisioning cloning the branch multiple times and then resetting all but one of the files on each branch? 😬

navidcy · 2024-06-06T00:22:25Z

Maybe this could be good practice for the group doing PR review training to hone their skills on during the hackathon?

Sure. But a bunch of smaller PRs would be even better. And more manageable in general. Even the “PR review master” might have trouble dealing with a PR that they need to run all the notebooks

adele-morrison · 2024-06-06T00:29:06Z

Oh yes, that's what a meant. A bunch of small PRs and then each person in the training could work on reviewing one each.

rbeucher · 2024-06-06T00:30:53Z

I'll have a look today or tomorrow.

rbeucher · 2024-06-06T00:32:06Z

That's one way to do it @edoddridge . I want to keep @max-anu contributions. I'll have a look and report back to you.

navidcy · 2024-06-06T00:39:34Z

I believe there is a way to keep @max-anu's commits in each file! Let's do that way please...

navidcy · 2024-07-07T11:12:47Z

Makes sense. Why do all the automatically-generated PRs add pandas then? I mean this could be a rhetorical question... Don't feel obliged to answer it.

(I was confused by it actually and was inferring that I don't know Intake well enough...)

dougiesquire · 2024-07-07T11:17:45Z

I'm not sure sorry - @max-anu generated those. But the imports can be removed if pandas isn't explicitly used anywhere.

rbeucher · 2024-07-07T21:01:08Z

Hi @navid Constantinou,

The idea behind generating those PRs was to have a starting point and use the work from @max-anu.

I didn't get the opportunity to review them, and I chose not to rebase the code onto main.

Yes, they do need work, and cleaning.

I don't think that closing PRs before the conversion was done is a good idea.

Thomas-Moore-Creative · 2024-07-15T00:50:36Z

I don't know about all the PR's or if this is helpful but for #372 here is a list of the changes made: #372 (comment)

( mostly thanks to guidance above from @anton-seaice )

julia-neme · 2024-07-16T05:25:12Z

I was working on switching my recipes to using intake and I have two comments:

I think that not being able to add start_time and end_time as arguments when opening a variable is inefficient. Is this a feature that could be added to intake?
The catalog does not have any panant experiments indexed, so it won't work for any of the notebooks doing things model agnostic yet. Can these experiments be added too?

anton-seaice · 2024-07-16T23:47:27Z

Thanks Julia! Its good people are getting stuck into it

I think that not being able to add start_time and end_time as arguments when opening a variable is inefficient. Is this a feature that could be added to intake?

@dougiesquire can explain this better than I can. There's two things that come to mind (explained more in this tutorial).

running the .to_dask() command doesn't load the data into memory yet, it's loaded 'lazily', so its just got the dimensions, coordinates & metadata available but won't load the actual data until a calculation is performed. This allows the times needed for the analysis to be selected, and once the calculation is performed only data from these times should be loaded. i.e. like this:

data_ic = catalog[experiment].search(
    variable=variable, 
    frequency="1mon"
).to_dask()

data_ic_times = data_ic.sel(time=slice(start_time, end_time))

.. do the calculation

If that to_dask step is slow, it may be due to there being many files to concatenate. You can make you instance bigger, make sure you start a dask cluster (with threads_per_worker=1) before running to_dask, and add these keywords to the .to_dask() call:

xarray_combine_by_coords_kwargs=dict(
    compat="override",
    data_vars="minimal",
    coords="minimal"
)

catalog[experiment].search(
    variable=variable, 
).to_dask(
    xarray_combine_by_coords_kwargs=xarray_combine_by_coords_kwargs,
)

These keywords reduce the number of consistencny checks xarray does when concatenating the netcdf files with the data. We don't really need the checks because we know model data is output on a consistent grid and with the same variables in every file etc

The catalog does not have any panant experiments indexed, so it won't work for any of the notebooks doing things model agnostic yet. Can these experiments be added too?

We did start digging into that here ... if there are specific experiments you think are worth adding, can you add them to that thread please?

navidcy · 2024-07-19T03:51:12Z

I'm hitting this issue that @julia-neme mentions above!

I need to load 500 years of data just to slice 2 months of output... seems a bit over the top?

Just adding another voice here in the struggles. Nothing more to add... :)

navidcy · 2024-07-19T03:56:44Z

@anton-seaice I'm not sure exactly what you are suggesting (not sure if it's a bit an esoteric dialogue with @dougiesquire in which case I can ignore)

But before we could avoid loading 500 years of output with 2 keywords. What's the equivalent way to do it now? We need to add an example in the Intake tutorial -- sorry if this is there and (me and @julia-neme) seem to have missed it...

julia-neme · 2024-07-19T04:09:07Z

I've tested how much time it takes with the cookbook, intake without kwargs and intake with kwargs (https://gist.github.com/julia-neme/80a277a0f877c044b2a83a097bba744c)

The cookbook is significantly faster and the kwargs don't make much improvement :( But maybe there's a smarter way to set them up that I'm not aware of.

navidcy · 2024-07-19T04:13:55Z

Thanks @julia-neme

Hm...........

That's 20x slowdown.....

I'm feeling bit eerie now with Intake....

adele-morrison · 2024-07-19T07:32:22Z

Great job testing @julia-neme ! Any way we can fix this @dougiesquire @anton-seaice ?

dougiesquire · 2024-07-22T10:34:16Z

Thanks all for the feedback. Yeah, unfortunately intake-esm doesn't currently provide the ability to filter based on time ranges, so the full dataset is always opened (lazily). When the user actually only wants a small temporal subset of the data, this will add a small additional overhead.

A couple of things to note:

The opening of the dataset(s) is done in parallel so using more workers will speed things up.
Opening the data typically comprises a small fraction of the total analysis time.
What's actually happening in this step is that xarray is loading into memory the metadata from each of the comprising files and working out how the files can be concatenated/merged into a single dataset. Things that slow this down are:
- lot's of files to open - more workers will make things faster
- xarray's process for determining how to concatenate/merge the files includes lots of checks to deal with edge cases - the kwargs @anton-seaice suggests bypass some of the unnecessary checks and can speed things up, particularly when the files have messy/missing/difficult metadata.

That's 20x slowdown.....

Regarding @julia-neme's timing tests, one has to be careful checking the time taken to open xarray objects since the data is cached. I couldn't reproduce Julia's results and my guess is that the cell calling cc.querying.getvar had already been run in Julia's notebook, so the data was cached. If I run Julia's notebook fresh on a full Cascade Lake node (48 cores) I get the following timings:

cookbook: ~10s
intake catalog: ~20s

So slower, of course, but not a factor of 20x. The difference will obviously also reduce with the size of the time range needed.

I should also point out that there is an open issue and PR with intake-esm to add the ability to filter based on time ranges. I think we'll have more resources soon to point at the ACCESS-NRI Intake catalog and this could be something we take on.

navidcy · 2024-07-22T10:46:19Z

Thanks for this @dougiesquire.

2x slower is just fine of course!

I've seen that @julia-neme was bumping into more (related?) issues? see #363 (comment)

julia-neme · 2024-07-22T11:56:53Z

Probably my issues are worsened because I am working with small ARE sessions, or medium. I don't really want to open a full node that I'd only need for the loading, but not really for any of the computations after. Specially if I can open data via some other, less costly way (i.e. xr.open_mfdataset, cookbook). To me it seems a waste of resources? But maybe it's not? What do people think?

adele-morrison · 2024-07-22T20:58:06Z

Generally I’m not concerned by the amount of resources we use for analysis, even if people are always running XXLarge (full node) ARE sessions. The amount of compute is still tiny compared with most of our allocations. Though it does seem overkill if the large compute is only needed for Intake.

dougiesquire · 2024-07-22T21:57:09Z

Though it does seem overkill if the large compute is only needed for Intake.

It's not needed, it will make things faster

AndyHoggANU · 2024-07-22T23:57:45Z

I'm also not worried by the resource use if it's needed -- but I agree this is something we can look to improve with intake in the medium term.

edoddridge · 2024-07-23T00:28:17Z

I think that level of usage is very much in the noise. To put some numbers on this, the estimated usage from ARE for an 8 hour session with XX-Large is

SU estimate
28 cpu cores + 252GB mem on normalbw queue (1.25 SUs/core/h) for 8h = 280 SUs

It's not a big deal.

navidcy · 2024-07-23T04:25:57Z

I think you/we/everybody should go full on using the ARE resources and save your time and effort for analyses, thinking, fun activities, etc.

julia-neme · 2024-07-23T05:13:44Z

Jaja great! I might be scarred from my beginner days when I was told to get better at dask instead of opening bigger gadi_jupyter notebooks 😝 I'll up my CPU game.

navidcy · 2024-07-30T00:34:53Z

@anton-seaice, @dougiesquire:

I'm trying to understand how to proceed here.

Personally I'm hitting big slowdown issues, much worse then 2x reported above. @julia-neme seems to be doing so as well and we have an open channel of communications providing mental support to each other.

But if me and @julia-neme are hitting these then I expect most COSIMA users to hit those as well.

I see a bunch of kwargs here proposed by @anton-seaice. Do we need those? Should we always have them? If so, we need to have them in the PRs for the intake conversion... Otherwise they are hidden in a comment in an issue.

data_ic = catalog[experiment].search(
    variable=variable, 
    frequency="1mon"
).to_dask()

data_ic_times = data_ic.sel(time=slice(start_time, end_time))

.. do the calculation
If that to_dask step is slow, it may be due to there being many files to concatenate. You can make you instance bigger, make sure you start a dask cluster (with threads_per_worker=1) before running to_dask, and add these keywords to the .to_dask() call:
xarray_combine_by_coords_kwargs=dict(
    compat="override",
    data_vars="minimal",
    coords="minimal"
)

catalog[experiment].search(
    variable=variable, 
).to_dask(
    xarray_combine_by_coords_kwargs=xarray_combine_by_coords_kwargs,
)

I'm a bit in despair. I'm very intrigued and interested in understanding the nitty gritty details of dask and xarray but in terms of what we suggests all COSIMAers to do, I would like to give them something that works. (Also for myself btw, most times I'd like just to copy a line and paste it and load the data -- I don't wanna be thinking of xarray internals etc.) If we should have these kwargs in then let's have them there all the time?

At the moment I'm trying to push through several of the open PRs for intake conversion and I'm hitting these issues. Perhaps we should pause trying to penetrate those PRs until I understand what's happening?

navidcy · 2024-07-30T00:37:04Z

Btw I'm hitting these issues even when I'm trying to load 1 variable... so no concatenation (or at least no concatenation as far as I understand!)

dougiesquire · 2024-07-30T00:39:26Z

@navidcy could you please point me to an example of your issue(s)?

navidcy · 2024-07-30T00:41:23Z

I'll try to do that later... Or @adele-morrison / @julia-neme feel free to do so?

#344 (comment) might be one?

adele-morrison · 2024-07-30T01:20:16Z

I did some more timing tests, and I actually find the kwargs don't help at all. There is just a change in timing due to caching on the 2nd run using Intake. I'm getting that Intake is about 2x slower than the cookbook when both are run for the first time.

This one runs the kwargs case first (50s for Intake compared with 28s for cookbook):

This one I changed the order and did no kwargs first (52s for Intake vs 25s for cookbook).

Running Intake a second time with either kwargs and no kwargs is much faster than the first time.

So I don't think it will help us to use kwargs.

anton-seaice · 2024-07-30T02:25:12Z

I see a bunch of kwargs here proposed by @anton-seaice. Do we need those? Should we always have them? If so, we need to have them in the PRs for the intake conversion... Otherwise they are hidden in a comment in an issue.

Apologies for the confusion about this. These kwargs help with sea-ice (cice) data but not with ocean (mom) data.

Its similar to the decode_coords=False which we used to set for the cookbook for cice data.

MOM only add 1d coordinates (i.e. xt_ocean, yt_ocean and time) to its output:

Whilst cice adds 2d coordinates to its output (geographic lons/lats at T and U points):

When .to_dask() is called this is concatenating all the files for that variable. Part of the concatenation is, by default, xarray will check that all of the coordinates in the files are consistent (i.e. the times don't overlap, but the other coordinates (lat/lon) are identical). Doing this check with 2d arrays from cice is much slower than doing this check with 1d arrays from mom. As we know the coordinates are consistent (they come from the same model at all times), we can tell xarray not to do the checks for consistency by setting the xarray_combine_by_coords_kwargs argument and therefore speed up loading cice data.

dougiesquire · 2024-07-30T03:27:10Z

Personally I'm hitting big slowdown issues

#344 (comment) might be one?

@navidcy, I've looked into this and replied, but I didn't encounter the big slowdown issues you refer to here. I think this one might mostly have been a case of comparing apples to oranges.

navidcy added the ❓ question Further information is requested label Jan 8, 2024

navidcy changed the title ~~Cookbook _or_ ACCESS-NRI Intake Catalog~~ Cookbook OR ACCESS-NRI Intake Catalog Jan 8, 2024

rbeucher assigned max-anu Jan 28, 2024

rbeucher changed the title ~~Cookbook OR ACCESS-NRI Intake Catalog~~ Converting notebooks from COSIMA Cookbook to ACCESS-NRI intake catalog Jan 29, 2024

adele-morrison added the 🕹️ hackathon 4.0 label Apr 16, 2024

rbeucher mentioned this issue Jun 6, 2024

Intake conversion Fig3-GlobalTimeseries #344

Merged

navidcy added 🛸 updating An existing notebook needs to be updated and removed ❓ question Further information is requested labels Jul 29, 2024

navidcy pinned this issue Jul 29, 2024

Converting notebooks from COSIMA Cookbook to ACCESS-NRI intake catalog #313

Converting notebooks from COSIMA Cookbook to ACCESS-NRI intake catalog #313

Comments

navidcy commented Jan 8, 2024 • edited by julia-neme Loading

Examples

navidcy commented Jan 8, 2024

aidanheerdegen commented Jan 8, 2024

navidcy commented Jan 8, 2024

adele-morrison commented Jan 8, 2024

anton-seaice commented Jan 8, 2024

dougiesquire commented Jan 8, 2024

aidanheerdegen commented Jan 9, 2024

Thomas-Moore-Creative commented Jan 10, 2024

rbeucher commented Jan 23, 2024

rbeucher commented Jan 24, 2024

access-hive-bot commented Apr 24, 2024

dougiesquire commented May 1, 2024

rbeucher commented May 1, 2024

AndyHoggANU commented Jun 5, 2024

rbeucher commented Jun 5, 2024

AndyHoggANU commented Jun 5, 2024

navidcy commented Jun 6, 2024

adele-morrison commented Jun 6, 2024

edoddridge commented Jun 6, 2024

navidcy commented Jun 6, 2024 • edited Loading

adele-morrison commented Jun 6, 2024

rbeucher commented Jun 6, 2024

rbeucher commented Jun 6, 2024

navidcy commented Jun 6, 2024

navidcy commented Jul 7, 2024 • edited Loading

dougiesquire commented Jul 7, 2024 • edited Loading

rbeucher commented Jul 7, 2024

Thomas-Moore-Creative commented Jul 15, 2024 • edited Loading

julia-neme commented Jul 16, 2024

anton-seaice commented Jul 16, 2024

navidcy commented Jul 19, 2024

navidcy commented Jul 19, 2024

julia-neme commented Jul 19, 2024

navidcy commented Jul 19, 2024

adele-morrison commented Jul 19, 2024

dougiesquire commented Jul 22, 2024

navidcy commented Jul 22, 2024

julia-neme commented Jul 22, 2024

adele-morrison commented Jul 22, 2024

dougiesquire commented Jul 22, 2024

AndyHoggANU commented Jul 22, 2024

edoddridge commented Jul 23, 2024

navidcy commented Jul 23, 2024 • edited Loading

julia-neme commented Jul 23, 2024

navidcy commented Jul 30, 2024

navidcy commented Jul 30, 2024

dougiesquire commented Jul 30, 2024

navidcy commented Jul 30, 2024 • edited Loading

adele-morrison commented Jul 30, 2024

anton-seaice commented Jul 30, 2024

dougiesquire commented Jul 30, 2024 • edited Loading

navidcy commented Jan 8, 2024 •

edited by julia-neme

Loading

navidcy commented Jun 6, 2024 •

edited

Loading

navidcy commented Jul 7, 2024 •

edited

Loading

dougiesquire commented Jul 7, 2024 •

edited

Loading

Thomas-Moore-Creative commented Jul 15, 2024 •

edited

Loading

navidcy commented Jul 23, 2024 •

edited

Loading

navidcy commented Jul 30, 2024 •

edited

Loading

dougiesquire commented Jul 30, 2024 •

edited

Loading