Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GA-1534] interp_hybrid_to_pressure failing on large datasets #633

Open
kafitzgerald opened this issue Jul 17, 2024 · 4 comments
Open

[GA-1534] interp_hybrid_to_pressure failing on large datasets #633

kafitzgerald opened this issue Jul 17, 2024 · 4 comments
Labels
bug Something isn't working CUPiD Related to the NSF NCAR CUPiD project scalability and performance Related to scalability, performance, or benchmarking support Support request opened by outside user/collaborator

Comments

@kafitzgerald
Copy link
Contributor

Noting this here for tracking purposes.

While some performance improvements were made in #592, we've gotten another report about problems with interp_hybrid_to_pressure. In particular, it's failing while operating on a larger dataset on Casper. I've been able to replicate the issue and it seems like it has to do with a very large Dask task graph. I'm guessing it's a combination of our internal usage of Dask in geocat-comp and the size of the dataset. We didn't see this failure while testing before because the dataset was much smaller and indeed when you subset the dataset temporally and run the function again it executes successfully.

I've followed up with a temporary workaround, but it'd be good to prioritize this. This function gets a good bit of use and it's likely to come up again. I also suspect there's still a lot of performance improvements to be made here.

@kafitzgerald kafitzgerald added bug Something isn't working support Support request opened by outside user/collaborator scalability and performance Related to scalability, performance, or benchmarking CUPiD Related to the NSF NCAR CUPiD project labels Jul 17, 2024
@sudhansu-s-rath
Copy link

sudhansu-s-rath commented Aug 19, 2024

Hi, It Seems the same thing is happening to me. I can't go for the workaround with smaller data. I would greatly appreciate any solution to this.

Original post: [https://discourse.pangeo.io/t/code-hangs-while-saving-dataset-to-disk-using-to-netcdf/4413]https://discourse.pangeo.io/t/code-hangs-while-saving-dataset-to-disk-using-to-netcdf/4413
I tried both NetCDF and Zarr format to save the data using dask clients too,
It’s still the same issue, the code does not stop and keeps on running. I saved precip data with the same procedure earlier (~117 GB for each model file). I guess the problem is with 3d pressure level variables.
If any one want to reproduce the error:

I am trying to save MIROC6_historical gs://cmip6/CMIP6/CMIP/MIROC/MIROC6/historical/r1i1p1f1/6hrLev/ta/gn/v20191114/

after reading this above as ds1, I interpolate to get one pressure level

import geocat.comp as gc

ta = ds1.ta

ps = ds1.ps

hyam = ds1.a

hybm = ds1.b

p0 = ds1.p0

new_levels = np.array([85000])

ta_850 = gc.interpolation.interp_hybrid_to_pressure(ta, ps, hyam, hybm, p0, new_levels=new_levels,lev_dim=None, method=‘linear’, extrapolate=False, variable=None, t_bot=None, phi_sfc=None)

Now I want to save the dataset to my local cluster.
ta_850.to_zarr(‘/path/miroc6.zarr’, consolidated=True)
or
ta_850.to_netcdf(‘/path/miroc6.nc’)

The code runs forever and the process is not complete. Any help in solving this will be appreciated. Thank you so much for your attention and participation.

@kafitzgerald
Copy link
Contributor Author

Hi, @sudhansu-s-rath thanks for the note!

It's interesting that you're seeing this with extrapolate=False as well. It looks like the time dimension on your dataset is quite large though and you're working with a lot of data.

Could you share what version of geocat-comp you're using and a bit more info about where you're running this?

If you're looking for a workaround, you might try working on smaller temporal subsets of the data.

If you're up for some deeper troubleshooting, I'd take a look at some of the diagnostics and task graphs from Dask if you haven't already.

I hope to do some additional profiling of this function this week (it really hasn't been tested at scale) and might have some better suggestions then as well.

@sudhansu-s-rath
Copy link

Hi @kafitzgerald ,

Thanks for your response.

  1. The output from gc.version, says its "'2024.4.0'"

  2. I am running this in my department cluster with the following resources in my batch file:
    #!/bin/bash
    #SBATCH --job-name=cmipdload_ta
    #SBATCH -p node
    #SBATCH -n 4
    #SBATCH --time=96:00:00
    #SBATCH --mem-per-cpu=20g
    #SBATCH --output=cmipdload_ta.log
    #SBATCH --mail-user=emailXXXX@illinois.edu
    ~/miniconda3/envs/g_env/bin/python /Databatch/dloadscript_ta.py

  3. I need to compute this for the entire model time, so I can't use a small slice of computation, I tried to check with a small slice of a few days, and the code works fine.

  4. Let me know how I can provide more info to help you troubleshoot this, I have been struggling with this for a long time.

  5. I have tried this with variable ta, the code completed in 20 hours, and the output was ~ 19 GB, But when the same I tried with ua or va the code kept on running forever. Even after reaching the timeout, I checked the output file with mostly blank data.

  6. I tried to use dask cluster in a jupyter notebook too, but with no success.

@kafitzgerald
Copy link
Contributor Author

kafitzgerald commented Sep 5, 2024

@sudhansu-s-rath I'm still looking into this, but wanted to get back to you with a few suggestions at least.

It appears the dataset you're working with is quite large and when processing the full dataset the memory usage is quite significant (often spilling to disk in my case, which slows things down significantly) and the Dask task graph rather large. It's possible data transfer from cloud storage is playing a role here as well. I did have some luck using dask.optimize() to help with the task graph, but it's still a large, memory intensive workflow.

It sounds like you're interested in running this for the full dataset/simulation, but breaking this down into processing steps and then appending to a file or writing multiple intermediate files might be a good approach here (the calculations themselves are largely independent in time at least) and avoid a lot of the memory related issues. I think that's probably the most efficient currently available approach especially if subsetting to a specific area / time of interest first isn't an option.

You may also find some relevant tips/resources for optimization and profiling here: https://docs.xarray.dev/en/latest/user-guide/dask.html#optimization-tips

@anissa111 anissa111 changed the title interp_hybrid_to_pressure failing on large datasets [GA-1534] interp_hybrid_to_pressure failing on large datasets Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CUPiD Related to the NSF NCAR CUPiD project scalability and performance Related to scalability, performance, or benchmarking support Support request opened by outside user/collaborator
Projects
None yet
Development

No branches or pull requests

2 participants