Append to icechunk stores #272

abarciauskas-bgse · 2024-10-25T16:08:18Z

This resizes the arrays which are being appended to and, probably too naïvely, increments the append_dim index of the chunk key by an offset of the existing number of chunks along the append dimension.

Also Zarr append ref: https://github.com/zarr-developers/zarr-python/blob/main/src/zarr/core/array.py#L1134-L1186

Closes Support appending to icechunk store #311
Tests added
Tests passing
Full type hint coverage
Changes are documented in docs/releases.rst
New functions/methods are listed in api.rst
New functionality has documentation

virtualizarr/writers/icechunk.py

norlandrhagen · 2024-10-25T17:09:14Z

virtualizarr/writers/icechunk.py

@@ -124,15 +134,37 @@ def write_virtual_variable_to_icechunk(
    group: "Group",
    name: str,
    var: Variable,
+    append_dim: Optional[str] = None,


Probably a down the road concern, but maybe we should add a validation / check that the append dim exists within the store.

TomNicholas

All this does at the moment is resize the arrays which are being appended to and, probably too naïvely, increments the append_dim index of the chunk key by an offset of the existing number of chunks along the append dimension.

I think that's great! Does xarray have any similar logic in it?

Also this is not fully working yet, it is getting a decompression error 😭

This feature should be orthogonal to all of that, so to begin with I would concentrate on writing tests with very simple arrays, even uncompressed ones.

virtualizarr/writers/icechunk.py

TomNicholas · 2024-10-25T17:34:32Z

virtualizarr/writers/icechunk.py

+    mode = store.mode.str
+
+    # Aimee: resize the array if it already exists
+    # TODO: assert chunking and encoding is the same


Should also test that it raises a clear error if you try to append with chunks of a different dtype etc. I would hope zarr-python would throw that for us.

virtualizarr/writers/icechunk.py

TomNicholas · 2024-10-25T17:39:15Z

virtualizarr/writers/icechunk.py

+            existing_num_chunks = int(
+                existing_size / existing_array.chunks[append_axis]
+            )


There's a whole beartrap here around noticing if the last chunk is smaller than the other chunks. We should throw in that case (because zarr can't support it without variable-length chunks).

virtualizarr/writers/icechunk.py

abarciauskas-bgse · 2024-10-25T19:51:24Z

I think that's great! Does xarray have any similar logic in it?

In the case of appending to a zarr store using xarray,

From what I can tell, resizing happens here. (Btw if someone can explain write_region to me I would appreciate it, I couldn't find good documentation anywhere).
For writing actual chunks of data to keys, I believe that currently happens here (in zarr.array._set_selection. I will continue to dig into how it's working in xarray and zarr to understand how it should work here.

Also this is not fully working yet, it is getting a decompression error 😭

This feature should be orthogonal to all of that, so to begin with I would concentrate on writing tests with very simple arrays, even uncompressed ones.

Yes I think my next step will be to write some simple tests.

codecov · 2024-10-26T22:04:10Z

Codecov Report

Attention: Patch coverage is 36.22449% with 250 lines in your changes missing coverage. Please review.

Project coverage is 79.74%. Comparing base (09e4752) to head (532ff38).
Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
virtualizarr/tests/test_writers/test_icechunk.py	1.35%	146 Missing ⚠️
virtualizarr/writers/icechunk.py	0.00%	56 Missing ⚠️
virtualizarr/tests/test_codecs.py	69.49%	18 Missing ⚠️
virtualizarr/codecs.py	64.28%	15 Missing ⚠️
virtualizarr/manifests/utils.py	74.54%	14 Missing ⚠️
virtualizarr/accessor.py	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #272      +/-   ##
==========================================
+ Coverage   74.96%   79.74%   +4.78%     
==========================================
  Files          41       51      +10     
  Lines        2552     3669    +1117     
==========================================
+ Hits         1913     2926    +1013     
- Misses        639      743     +104

Flag	Coverage Δ
unittests	`79.74% <36.22%> (+4.78%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

for more information, see https://pre-commit.ci

TomNicholas

Thanks @abarciauskas-bgse ! I have a lot of smaller comments, but generally I think this is looking really promising!

virtualizarr/tests/test_writers/test_icechunk.py

virtualizarr/tests/test_writers/test_icechunk_append.py

TomNicholas · 2024-11-07T18:47:47Z

virtualizarr/tests/test_writers/test_icechunk_append.py

+    icechunk_filestore.commit(
+        "test commit"
+    )  # need to commit it in order to append to it in the next lines


I'm confused why that would be the case. What goes wrong if you write without committing, then append?

Is it to do with the mode?

We need to open the existing store in append mode in order to append otherwise I get the error:

zarr.errors.ContainsGroupError: A group exists in store <icechunk.IcechunkStore object at 0x10eaf9100> at path ''.

That's the error that I get just trying to use the store object from IcechunkStore.create(. But if I do use a store with mode='a' but do not commit to the first store object, I get the following error:

FileNotFoundError: <icechunk.IcechunkStore object at 0x10960d490>

virtualizarr/tests/test_writers/test_icechunk_append.py

virtualizarr/writers/icechunk.py

TomNicholas · 2024-11-08T16:28:25Z

virtualizarr/writers/icechunk.py

+        # determine number of existing chunks along the append axis
+        existing_num_chunks = num_chunks(
+            array=group[name],
+            axis=append_axis,
+        )

-    # creates array if it doesn't already exist
-    arr = group.require_array(
-        name=name,
-        shape=zarray.shape,
-        chunk_shape=zarray.chunks,
-        dtype=encode_dtype(zarray.dtype),
-        codecs=zarray._v3_codec_pipeline(),
-        dimension_names=var.dims,
-        fill_value=zarray.fill_value,
-        # TODO fill_value?
-    )
-
-    # TODO it would be nice if we could assign directly to the .attrs property
-    for k, v in var.attrs.items():
-        arr.attrs[k] = encode_zarr_attr_value(v)
-    arr.attrs["_ARRAY_DIMENSIONS"] = encode_zarr_attr_value(var.dims)
+        # resize the array
+        arr = resize_array(
+            group=group,
+            name=name,
+            var=var,
+            append_axis=append_axis,
+        )


Here you determine existing_num_chunks, but then don't actually use it until inside write_manifest_virtual_refs. I think you could move the num_chunks call inside write_manifest_virtual_refs, and eliminate the need to pass the existing_num_chunks arg down.

Ah yes you're right but the challenge is, after we resize the array, than the function existing_num_chunks will not return the right size. I will think about if there is a better way to handle this, so we don't have to pass the existing_num_chunks arg around

virtualizarr/writers/icechunk.py

abarciauskas-bgse · 2024-11-25T19:24:01Z

I may move this PR back to draft or close for now as I need to investigate some decompression errors showing up when using the NOAA OISST CDR dataset (see https://github.com/zarr-developers/VirtualiZarr/blob/icechunk-append/noaa-cdr-sst.ipynb and run in mybinder at https://mybinder.org/v2/gh/zarr-developers/VirtualiZarr/icechunk-append?labpath=noaa-cdr-sst.ipynb) and need to postpone that investigation to focus on some other tasks.

fyi @TomNicholas @mpiannucci

TomNicholas · 2024-11-25T19:33:00Z

I don't know what the issue is, but this seems like an unrelated error to the functionality that's added in this PR surely?

This PR has tests, which pass, so I would be in favour of merging it now and making the functionality public + adding documentation in a follow-up. Also some of the codecs stuff might be useful to @norlandrhagen in #271.

Also if it just sits here for a while in open status that's also fine.

mpiannucci · 2024-11-26T01:37:15Z

Hmmmm so I have seen that error before, when I had the wrong order of codecs it caused the compression to be messed up.

abarciauskas-bgse · 2024-11-26T01:55:46Z

@TomNicholas thanks - I do very much want to merge it of course but I would feel a lot better about it if I have a working "real data" example first.

@mpiannucci that is a good lead. Would you just check the metadata?

The thing that seems like the most important "clue" is it seems like the "0th" time chunk is being overwritten - which it shouldn't be since we are appending along the time dimension. Only chunks 2 and 3 should be added. I just updated the notebook to show exactly what I mean: compare cell 11 with cell 20.

mpiannucci · 2024-11-26T02:04:24Z

Oh that's a good find! It looks like chunk is being overwritten somehow for sure.

If I understand the notebook, the original dataset loads fine from icechunk to xarray before the append. So at least that narrows it down.

I am happy to try and take a look (this confirms we need a way to get references out of icechunk too)

mpiannucci · 2024-11-26T16:25:34Z

BTW Icechunk alpha 5 is out now! So it may be worth syncing this PR and see if that helps too

abarciauskas-bgse · 2024-11-26T18:41:52Z

@mpiannucci thanks for the heads up, I upgraded but it did not resolve the error.

Also, I'm sure you realized this but I think with zarr a change has to be made https://github.com/mpiannucci/kerchunk/blob/v3/kerchunk/utils.py#L55 from mode='w' to read_only=False when upgrading to icechunk 0.1.0a5 (as it depends on zarr v3.0.0-beta.2).

mpiannucci · 2024-11-26T20:14:09Z

Thanks! Ill sync those changes in kerchunk land. It should be good now.

I will have time tomorrow to test this PR out for real

Initial attempt at appending

d3a4048

abarciauskas-bgse temporarily deployed to test-release October 25, 2024 16:08 — with GitHub Actions Inactive

abarciauskas-bgse requested review from norlandrhagen and TomNicholas October 25, 2024 16:10

norlandrhagen reviewed Oct 25, 2024

View reviewed changes

virtualizarr/writers/icechunk.py Show resolved Hide resolved

norlandrhagen reviewed Oct 25, 2024

View reviewed changes

TomNicholas added enhancement New feature or request Icechunk 🧊 Relates to Icechunk library / spec labels Oct 25, 2024

TomNicholas reviewed Oct 25, 2024

View reviewed changes

abarciauskas-bgse self-assigned this Oct 25, 2024

abarciauskas-bgse added 2 commits October 25, 2024 16:10

Working on tests for generate chunk key function

5d5f9e2

Linting

360ea14

abarciauskas-bgse temporarily deployed to test-release October 26, 2024 19:25 — with GitHub Actions Inactive

Refactor gen virtual dataset method

d3c2851

abarciauskas-bgse temporarily deployed to test-release October 26, 2024 22:03 — with GitHub Actions Inactive

abarciauskas-bgse added 2 commits October 27, 2024 16:56

Fix spelling

a7a1e50

Linting

0365a45

abarciauskas-bgse temporarily deployed to test-release October 28, 2024 00:02 — with GitHub Actions Inactive

Linting

5846d7e

abarciauskas-bgse temporarily deployed to test-release October 28, 2024 23:07 — with GitHub Actions Inactive

Linting

66bbd6e

abarciauskas-bgse temporarily deployed to test-release October 30, 2024 01:55 — with GitHub Actions Inactive

Passing compression test

000c68f

abarciauskas-bgse temporarily deployed to test-release November 1, 2024 22:34 — with GitHub Actions Inactive

TomNicholas and others added 2 commits November 5, 2024 11:38

Merge branch 'main' into icechunk-append

3131167

[pre-commit.ci] auto fixes from pre-commit.com hooks

5906687

for more information, see https://pre-commit.ci

pre-commit-ci bot temporarily deployed to test-release November 5, 2024 18:39 Inactive

TomNicholas reviewed Nov 8, 2024

View reviewed changes

Add decorator for zarr python v3 test

39677e8

abarciauskas-bgse temporarily deployed to test-release November 22, 2024 19:05 — with GitHub Actions Inactive

Fix mypy and ruff errors

5fa7177

abarciauskas-bgse temporarily deployed to test-release November 22, 2024 19:09 — with GitHub Actions Inactive

abarciauskas-bgse added 2 commits November 23, 2024 12:39

Only append if append_dim in dims

e109c0d

Add example notebook

eb0e8f2

abarciauskas-bgse temporarily deployed to test-release November 25, 2024 18:36 — with GitHub Actions Inactive

Add a runtime

fd2df4e

abarciauskas-bgse temporarily deployed to test-release November 25, 2024 18:41 — with GitHub Actions Inactive

Add failing test

1659d21

abarciauskas-bgse temporarily deployed to test-release November 25, 2024 19:05 — with GitHub Actions Inactive

Fix multiple appends

f5976d1

abarciauskas-bgse temporarily deployed to test-release November 25, 2024 19:14 — with GitHub Actions Inactive

Fix test error message

f903291

abarciauskas-bgse temporarily deployed to test-release November 25, 2024 19:21 — with GitHub Actions Inactive

Add new cell to notebook to display original time chunk

c109626

abarciauskas-bgse temporarily deployed to test-release November 26, 2024 01:54 — with GitHub Actions Inactive

Upgrade icechunk to 1.0.0a5

dd9c381

abarciauskas-bgse temporarily deployed to test-release November 26, 2024 18:20 — with GitHub Actions Inactive

abarciauskas-bgse added 2 commits November 26, 2024 10:37

Upgrade icechunk in upstream.yml

7c1fcfa

Updated notebook with kechunk comment an upgraded icechunk version

1bb2ad0

abarciauskas-bgse deployed to test-release November 26, 2024 18:39 — with GitHub Actions View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Append to icechunk stores #272

Append to icechunk stores #272

abarciauskas-bgse commented Oct 25, 2024 •

edited

Loading

norlandrhagen Oct 25, 2024

TomNicholas left a comment

TomNicholas Oct 25, 2024

TomNicholas Oct 25, 2024

abarciauskas-bgse commented Oct 25, 2024

codecov bot commented Oct 26, 2024 •

edited

Loading

TomNicholas left a comment

TomNicholas Nov 7, 2024

TomNicholas Nov 7, 2024

abarciauskas-bgse Nov 12, 2024

TomNicholas Nov 8, 2024

abarciauskas-bgse Nov 11, 2024

abarciauskas-bgse commented Nov 25, 2024 •

edited

Loading

TomNicholas commented Nov 25, 2024

mpiannucci commented Nov 26, 2024

abarciauskas-bgse commented Nov 26, 2024

mpiannucci commented Nov 26, 2024

mpiannucci commented Nov 26, 2024

abarciauskas-bgse commented Nov 26, 2024

mpiannucci commented Nov 26, 2024 •

edited

Loading

Append to icechunk stores #272

Are you sure you want to change the base?

Append to icechunk stores #272

Conversation

abarciauskas-bgse commented Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

TomNicholas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abarciauskas-bgse commented Oct 25, 2024

codecov bot commented Oct 26, 2024 • edited Loading

Codecov Report

TomNicholas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abarciauskas-bgse commented Nov 25, 2024 • edited Loading

TomNicholas commented Nov 25, 2024

mpiannucci commented Nov 26, 2024

abarciauskas-bgse commented Nov 26, 2024

mpiannucci commented Nov 26, 2024

mpiannucci commented Nov 26, 2024

abarciauskas-bgse commented Nov 26, 2024

mpiannucci commented Nov 26, 2024 • edited Loading

abarciauskas-bgse commented Oct 25, 2024 •

edited

Loading

codecov bot commented Oct 26, 2024 •

edited

Loading

abarciauskas-bgse commented Nov 25, 2024 •

edited

Loading

mpiannucci commented Nov 26, 2024 •

edited

Loading