-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Append to icechunk stores #272
base: main
Are you sure you want to change the base?
Conversation
@@ -124,15 +134,37 @@ def write_virtual_variable_to_icechunk( | |||
group: "Group", | |||
name: str, | |||
var: Variable, | |||
append_dim: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a down the road concern, but maybe we should add a validation / check that the append dim exists within the store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this does at the moment is resize the arrays which are being appended to and, probably too naïvely, increments the
append_dim
index of the chunk key by an offset of the existing number of chunks along the append dimension.
I think that's great! Does xarray have any similar logic in it?
Also this is not fully working yet, it is getting a decompression error 😭
This feature should be orthogonal to all of that, so to begin with I would concentrate on writing tests with very simple arrays, even uncompressed ones.
virtualizarr/writers/icechunk.py
Outdated
mode = store.mode.str | ||
|
||
# Aimee: resize the array if it already exists | ||
# TODO: assert chunking and encoding is the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also test that it raises a clear error if you try to append with chunks of a different dtype etc. I would hope zarr-python would throw that for us.
virtualizarr/writers/icechunk.py
Outdated
existing_num_chunks = int( | ||
existing_size / existing_array.chunks[append_axis] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a whole beartrap here around noticing if the last chunk is smaller than the other chunks. We should throw in that case (because zarr can't support it without variable-length chunks).
In the case of appending to a zarr store using xarray,
Yes I think my next step will be to write some simple tests. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #272 +/- ##
==========================================
+ Coverage 74.96% 79.74% +4.78%
==========================================
Files 41 51 +10
Lines 2552 3669 +1117
==========================================
+ Hits 1913 2926 +1013
- Misses 639 743 +104
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @abarciauskas-bgse ! I have a lot of smaller comments, but generally I think this is looking really promising!
icechunk_filestore.commit( | ||
"test commit" | ||
) # need to commit it in order to append to it in the next lines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused why that would be the case. What goes wrong if you write without committing, then append?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it to do with the mode
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to open the existing store in append mode in order to append otherwise I get the error:
zarr.errors.ContainsGroupError: A group exists in store <icechunk.IcechunkStore object at 0x10eaf9100> at path ''.
That's the error that I get just trying to use the store object from IcechunkStore.create(
. But if I do use a store with mode='a' but do not commit to the first store object, I get the following error:
FileNotFoundError: <icechunk.IcechunkStore object at 0x10960d490>
virtualizarr/writers/icechunk.py
Outdated
# determine number of existing chunks along the append axis | ||
existing_num_chunks = num_chunks( | ||
array=group[name], | ||
axis=append_axis, | ||
) | ||
|
||
# creates array if it doesn't already exist | ||
arr = group.require_array( | ||
name=name, | ||
shape=zarray.shape, | ||
chunk_shape=zarray.chunks, | ||
dtype=encode_dtype(zarray.dtype), | ||
codecs=zarray._v3_codec_pipeline(), | ||
dimension_names=var.dims, | ||
fill_value=zarray.fill_value, | ||
# TODO fill_value? | ||
) | ||
|
||
# TODO it would be nice if we could assign directly to the .attrs property | ||
for k, v in var.attrs.items(): | ||
arr.attrs[k] = encode_zarr_attr_value(v) | ||
arr.attrs["_ARRAY_DIMENSIONS"] = encode_zarr_attr_value(var.dims) | ||
# resize the array | ||
arr = resize_array( | ||
group=group, | ||
name=name, | ||
var=var, | ||
append_axis=append_axis, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you determine existing_num_chunks
, but then don't actually use it until inside write_manifest_virtual_refs
. I think you could move the num_chunks
call inside write_manifest_virtual_refs
, and eliminate the need to pass the existing_num_chunks
arg down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes you're right but the challenge is, after we resize the array, than the function existing_num_chunks will not return the right size. I will think about if there is a better way to handle this, so we don't have to pass the existing_num_chunks arg around
I may move this PR back to draft or close for now as I need to investigate some decompression errors showing up when using the NOAA OISST CDR dataset (see https://github.com/zarr-developers/VirtualiZarr/blob/icechunk-append/noaa-cdr-sst.ipynb and run in mybinder at https://mybinder.org/v2/gh/zarr-developers/VirtualiZarr/icechunk-append?labpath=noaa-cdr-sst.ipynb) and need to postpone that investigation to focus on some other tasks. |
I don't know what the issue is, but this seems like an unrelated error to the functionality that's added in this PR surely? This PR has tests, which pass, so I would be in favour of merging it now and making the functionality public + adding documentation in a follow-up. Also some of the codecs stuff might be useful to @norlandrhagen in #271. Also if it just sits here for a while in open status that's also fine. |
Hmmmm so I have seen that error before, when I had the wrong order of codecs it caused the compression to be messed up. |
@TomNicholas thanks - I do very much want to merge it of course but I would feel a lot better about it if I have a working "real data" example first. @mpiannucci that is a good lead. Would you just check the metadata? The thing that seems like the most important "clue" is it seems like the "0th" time chunk is being overwritten - which it shouldn't be since we are appending along the time dimension. Only chunks 2 and 3 should be added. I just updated the notebook to show exactly what I mean: compare cell 11 with cell 20. |
Oh that's a good find! It looks like chunk is being overwritten somehow for sure. If I understand the notebook, the original dataset loads fine from icechunk to xarray before the append. So at least that narrows it down. I am happy to try and take a look (this confirms we need a way to get references out of icechunk too) |
BTW Icechunk alpha 5 is out now! So it may be worth syncing this PR and see if that helps too |
@mpiannucci thanks for the heads up, I upgraded but it did not resolve the error. Also, I'm sure you realized this but I think with zarr a change has to be made https://github.com/mpiannucci/kerchunk/blob/v3/kerchunk/utils.py#L55 from |
Thanks! Ill sync those changes in kerchunk land. It should be good now. I will have time tomorrow to test this PR out for real |
This resizes the arrays which are being appended to and, probably too naïvely, increments the append_dim index of the chunk key by an offset of the existing number of chunks along the append dimension.
Also Zarr append ref: https://github.com/zarr-developers/zarr-python/blob/main/src/zarr/core/array.py#L1134-L1186
docs/releases.rst
api.rst