Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading multiple ICESat-2 ATL11 point cloud data nicely via Zarr #100

Open
3 of 6 tasks
weiji14 opened this issue Jun 18, 2020 · 3 comments
Open
3 of 6 tasks

Reading multiple ICESat-2 ATL11 point cloud data nicely via Zarr #100

weiji14 opened this issue Jun 18, 2020 · 3 comments
Labels
enhancement ✨ New feature or request

Comments

@weiji14
Copy link
Owner

weiji14 commented Jun 18, 2020

Gathering some notes on how best to read multiple ICESat-2 ATL11 data (basically a point cloud) in a user friendly way, with metadata preserved!

TLDR: Be able to do xr.open_mfdataset("ATL11_*.h5", engine="zarr", ...).

Inspired by the blog post "Cloud-Performant NetCDF4/HDF5 Reading with the Zarr Library". Zarr is an amazing project, and I really like the .zmetadata json file which can be opened with a text editor and tell you stuff about the data. The dream would be to read HDF5 files in an out-of-core manner with Zarr like speed/abilities (through the .zmetadata pointer).

Jupyter notebook demo can be found at https://github.com/rsignell-usgs/hurricane-ike-water-levels/blob/master/coawst_3ways.ipynb. See also discussion thread at zarr-developers/zarr-python#535 on "Using the Zarr library to read HDF5".

Main hurdles to get through, dependent on upstream, there's two 'separate' parts:

Current situation in that I do HDF5 -> Zarr conversion, and read from that. It would be nice to stick to the original HDF5 data source (though I might need to flatten the nested ICESat-2 ATL11 data structure). Note that I'm not necessarily after raw speed, I just prefer readability (i.e. having xarray's wonderful annotated metadata).

Other open Issues/Pull Requests:

Blog posts:

You can tell I had way too many tabs open on my browser 😆

@weiji14
Copy link
Owner Author

weiji14 commented Jul 20, 2023

Putting down some notes on a potential HDF5 -> pandas.DataFrame direct conversion (that skips the intermediate xarray format) using the code at https://github.com/MAAP-Project/gedi-subsetter (thanks @chuckwondo for the pointer!).

Just some things to play with once I get some free time 🙂

@chuckwondo
Copy link

Awesome! Regarding the subset_hdf5 function, that's specific to the structure of GEDI data files (in particular, in relation to the BEAM* top-level groups), so you wouldn't want to use it for non-GEDI data files. For non-GEDI data files, you can directly use H5DataFrame.

@weiji14
Copy link
Owner Author

weiji14 commented Aug 10, 2023

H5DataFrame works for ICESat-2 ATL03 - ICESAT-2HackWeek/h5cloud#5 🎉 There are some small quirks (e.g. the need to access groups/variable via df["group/variable"] to get at the data), but it should work for ATL11 too 🤞

We're actually working on some benchmarks over in that repo (e.g. ICESAT-2HackWeek/h5cloud#9), and the H5DataFrame read method is looking to be ~4x faster than xarray's h5netcdf (and that's without considering the conversion from xarray.Dataset -> pd.DataFrame), so looking real promising!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement ✨ New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants