Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Append to icechunk stores #272

Open
wants to merge 78 commits into
base: main
Choose a base branch
from
Open

Append to icechunk stores #272

wants to merge 78 commits into from

Conversation

abarciauskas-bgse
Copy link
Collaborator

@abarciauskas-bgse abarciauskas-bgse commented Oct 25, 2024

This resizes the arrays which are being appended to and, probably too naïvely, increments the append_dim index of the chunk key by an offset of the existing number of chunks along the append dimension.

Also Zarr append ref: https://github.com/zarr-developers/zarr-python/blob/main/src/zarr/core/array.py#L1134-L1186

  • Closes Support appending to icechunk store #311
  • Tests added
  • Tests passing
  • Full type hint coverage
  • Changes are documented in docs/releases.rst
  • New functions/methods are listed in api.rst
  • New functionality has documentation

@@ -124,15 +134,37 @@ def write_virtual_variable_to_icechunk(
group: "Group",
name: str,
var: Variable,
append_dim: Optional[str] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a down the road concern, but maybe we should add a validation / check that the append dim exists within the store.

@TomNicholas TomNicholas added enhancement New feature or request Icechunk 🧊 Relates to Icechunk library / spec labels Oct 25, 2024
Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this does at the moment is resize the arrays which are being appended to and, probably too naïvely, increments the append_dim index of the chunk key by an offset of the existing number of chunks along the append dimension.

I think that's great! Does xarray have any similar logic in it?

Also this is not fully working yet, it is getting a decompression error 😭

This feature should be orthogonal to all of that, so to begin with I would concentrate on writing tests with very simple arrays, even uncompressed ones.

virtualizarr/writers/icechunk.py Show resolved Hide resolved
virtualizarr/writers/icechunk.py Show resolved Hide resolved
mode = store.mode.str

# Aimee: resize the array if it already exists
# TODO: assert chunking and encoding is the same
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also test that it raises a clear error if you try to append with chunks of a different dtype etc. I would hope zarr-python would throw that for us.

virtualizarr/writers/icechunk.py Outdated Show resolved Hide resolved
Comment on lines 156 to 158
existing_num_chunks = int(
existing_size / existing_array.chunks[append_axis]
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a whole beartrap here around noticing if the last chunk is smaller than the other chunks. We should throw in that case (because zarr can't support it without variable-length chunks).

virtualizarr/writers/icechunk.py Outdated Show resolved Hide resolved
virtualizarr/writers/icechunk.py Show resolved Hide resolved
@abarciauskas-bgse abarciauskas-bgse self-assigned this Oct 25, 2024
@abarciauskas-bgse
Copy link
Collaborator Author

I think that's great! Does xarray have any similar logic in it?

In the case of appending to a zarr store using xarray,

  • From what I can tell, resizing happens here. (Btw if someone can explain write_region to me I would appreciate it, I couldn't find good documentation anywhere).
  • For writing actual chunks of data to keys, I believe that currently happens here (in zarr.array._set_selection. I will continue to dig into how it's working in xarray and zarr to understand how it should work here.

Also this is not fully working yet, it is getting a decompression error 😭

This feature should be orthogonal to all of that, so to begin with I would concentrate on writing tests with very simple arrays, even uncompressed ones.

Yes I think my next step will be to write some simple tests.

Copy link

codecov bot commented Oct 26, 2024

Codecov Report

Attention: Patch coverage is 36.22449% with 250 lines in your changes missing coverage. Please review.

Project coverage is 79.74%. Comparing base (09e4752) to head (532ff38).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
virtualizarr/tests/test_writers/test_icechunk.py 1.35% 146 Missing ⚠️
virtualizarr/writers/icechunk.py 0.00% 56 Missing ⚠️
virtualizarr/tests/test_codecs.py 69.49% 18 Missing ⚠️
virtualizarr/codecs.py 64.28% 15 Missing ⚠️
virtualizarr/manifests/utils.py 74.54% 14 Missing ⚠️
virtualizarr/accessor.py 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #272      +/-   ##
==========================================
+ Coverage   74.96%   79.74%   +4.78%     
==========================================
  Files          41       51      +10     
  Lines        2552     3669    +1117     
==========================================
+ Hits         1913     2926    +1013     
- Misses        639      743     +104     
Flag Coverage Δ
unittests 79.74% <36.22%> (+4.78%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @abarciauskas-bgse ! I have a lot of smaller comments, but generally I think this is looking really promising!

virtualizarr/tests/test_writers/test_icechunk.py Outdated Show resolved Hide resolved
virtualizarr/tests/test_writers/test_icechunk_append.py Outdated Show resolved Hide resolved
virtualizarr/tests/test_writers/test_icechunk_append.py Outdated Show resolved Hide resolved
Comment on lines 126 to 128
icechunk_filestore.commit(
"test commit"
) # need to commit it in order to append to it in the next lines
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why that would be the case. What goes wrong if you write without committing, then append?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it to do with the mode?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to open the existing store in append mode in order to append otherwise I get the error:

zarr.errors.ContainsGroupError: A group exists in store <icechunk.IcechunkStore object at 0x10eaf9100> at path ''.

That's the error that I get just trying to use the store object from IcechunkStore.create(. But if I do use a store with mode='a' but do not commit to the first store object, I get the following error:

FileNotFoundError: <icechunk.IcechunkStore object at 0x10960d490>

virtualizarr/tests/test_writers/test_icechunk_append.py Outdated Show resolved Hide resolved
virtualizarr/writers/icechunk.py Outdated Show resolved Hide resolved
virtualizarr/writers/icechunk.py Outdated Show resolved Hide resolved
Comment on lines 199 to 211
# determine number of existing chunks along the append axis
existing_num_chunks = num_chunks(
array=group[name],
axis=append_axis,
)

# creates array if it doesn't already exist
arr = group.require_array(
name=name,
shape=zarray.shape,
chunk_shape=zarray.chunks,
dtype=encode_dtype(zarray.dtype),
codecs=zarray._v3_codec_pipeline(),
dimension_names=var.dims,
fill_value=zarray.fill_value,
# TODO fill_value?
)

# TODO it would be nice if we could assign directly to the .attrs property
for k, v in var.attrs.items():
arr.attrs[k] = encode_zarr_attr_value(v)
arr.attrs["_ARRAY_DIMENSIONS"] = encode_zarr_attr_value(var.dims)
# resize the array
arr = resize_array(
group=group,
name=name,
var=var,
append_axis=append_axis,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you determine existing_num_chunks, but then don't actually use it until inside write_manifest_virtual_refs. I think you could move the num_chunks call inside write_manifest_virtual_refs, and eliminate the need to pass the existing_num_chunks arg down.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes you're right but the challenge is, after we resize the array, than the function existing_num_chunks will not return the right size. I will think about if there is a better way to handle this, so we don't have to pass the existing_num_chunks arg around

virtualizarr/writers/icechunk.py Outdated Show resolved Hide resolved
virtualizarr/writers/icechunk.py Outdated Show resolved Hide resolved
@abarciauskas-bgse
Copy link
Collaborator Author

abarciauskas-bgse commented Nov 25, 2024

I may move this PR back to draft or close for now as I need to investigate some decompression errors showing up when using the NOAA OISST CDR dataset (see https://github.com/zarr-developers/VirtualiZarr/blob/icechunk-append/noaa-cdr-sst.ipynb and run in mybinder at https://mybinder.org/v2/gh/zarr-developers/VirtualiZarr/icechunk-append?labpath=noaa-cdr-sst.ipynb) and need to postpone that investigation to focus on some other tasks.

fyi @TomNicholas @mpiannucci

@TomNicholas
Copy link
Member

I don't know what the issue is, but this seems like an unrelated error to the functionality that's added in this PR surely?

This PR has tests, which pass, so I would be in favour of merging it now and making the functionality public + adding documentation in a follow-up. Also some of the codecs stuff might be useful to @norlandrhagen in #271.

Also if it just sits here for a while in open status that's also fine.

@mpiannucci
Copy link
Contributor

Hmmmm so I have seen that error before, when I had the wrong order of codecs it caused the compression to be messed up.

@abarciauskas-bgse
Copy link
Collaborator Author

@TomNicholas thanks - I do very much want to merge it of course but I would feel a lot better about it if I have a working "real data" example first.

@mpiannucci that is a good lead. Would you just check the metadata?

The thing that seems like the most important "clue" is it seems like the "0th" time chunk is being overwritten - which it shouldn't be since we are appending along the time dimension. Only chunks 2 and 3 should be added. I just updated the notebook to show exactly what I mean: compare cell 11 with cell 20.

@mpiannucci
Copy link
Contributor

Oh that's a good find! It looks like chunk is being overwritten somehow for sure.

If I understand the notebook, the original dataset loads fine from icechunk to xarray before the append. So at least that narrows it down.

I am happy to try and take a look (this confirms we need a way to get references out of icechunk too)

@mpiannucci
Copy link
Contributor

BTW Icechunk alpha 5 is out now! So it may be worth syncing this PR and see if that helps too

@abarciauskas-bgse
Copy link
Collaborator Author

@mpiannucci thanks for the heads up, I upgraded but it did not resolve the error.

Also, I'm sure you realized this but I think with zarr a change has to be made https://github.com/mpiannucci/kerchunk/blob/v3/kerchunk/utils.py#L55 from mode='w' to read_only=False when upgrading to icechunk 0.1.0a5 (as it depends on zarr v3.0.0-beta.2).

@mpiannucci
Copy link
Contributor

mpiannucci commented Nov 26, 2024

Thanks! Ill sync those changes in kerchunk land. It should be good now.

I will have time tomorrow to test this PR out for real

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Icechunk 🧊 Relates to Icechunk library / spec
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants