Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(fix): structured arrays for v2 #2681

Merged
merged 18 commits into from
Jan 21, 2025

Conversation

ilan-gold
Copy link
Contributor

@ilan-gold ilan-gold commented Jan 9, 2025

This is a best guess based on https://numpy.org/doc/2.1/reference/generated/numpy.dtype.kind.html and the fact that VLenBytes appears to be explicitly for strings.

This addresses the v2 case of #2134

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@ilan-gold ilan-gold marked this pull request as ready for review January 10, 2025 15:13
@ilan-gold
Copy link
Contributor Author

Hmm, this will need to handle the case where the array is not given the dtype.

@ilan-gold
Copy link
Contributor Author

ilan-gold commented Jan 10, 2025

(P.S I used zarr.create(path=...) which didn't error but actually does nothing and uses a MemoryStore, which seems like a common error i.e., using path instead of store)

@dstansby dstansby added the needs release notes Automatically applied to PRs which haven't added release notes label Jan 10, 2025
@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Jan 13, 2025
Comment on lines +197 to +202
# In the case of zarr v2, the simplest i.e., '|VXX' dtype is represented as a string
dtype_descr = self.dtype.descr
if self.dtype.kind == "V" and dtype_descr[0][0] != "" and len(dtype_descr) != 0:
dtype_json = tuple(self.dtype.descr)
else:
dtype_json = self.dtype.str
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my attempt to match the old behavior. I didn't look back at the old code yet, but if someone knows this to be wrong, would be great to know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks right to me

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am keen to see this go in

src/zarr/core/buffer/core.py Show resolved Hide resolved
@@ -220,6 +227,8 @@ def update_attributes(self, attributes: dict[str, JSON]) -> Self:


def parse_dtype(data: npt.DTypeLike) -> np.dtype[Any]:
if isinstance(data, list): # this is a valid _VoidDTypeLike check
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any iterable?

Copy link
Contributor Author

@ilan-gold ilan-gold Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to handle the [(field_name, field_dtype, field_shape), ...] case on https://numpy.org/doc/2.1/reference/arrays.dtypes.html#specifying-and-constructing-data-types but at the same time to obey
Screenshot 2025-01-14 at 19 22 14

This might require more stringent checking or tests...Not sure. The reason this tuple conversion happens is that lists (as data types) incoming from on-disk reads contain lists, not tuples. So maybe we should check list and data[0] is also list? And throw an error if it isn't? I'm not sure what else could be in the lists though

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the dtype constructor would make an exception (or our own comprehension fails) in the case the JSON on disk was edited - so I'm not too worried.

@@ -220,6 +227,8 @@ def update_attributes(self, attributes: dict[str, JSON]) -> Self:


def parse_dtype(data: npt.DTypeLike) -> np.dtype[Any]:
if isinstance(data, list): # this is a valid _VoidDTypeLike check
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the dtype constructor would make an exception (or our own comprehension fails) in the case the JSON on disk was edited - so I'm not too worried.

@martindurant
Copy link
Member

ping @d-v-b - I'd be happy to merge if you have no objections.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 21, 2025

I think this looks good, but I'm admittedly not a structured dtype user, so I can't give it a very close examination. I think the only thing remaining is to use the new changelog system added by #2736

@martindurant
Copy link
Member

I moved the changelog line to where it should be and will merge this when green, and leave a note on the towncrier PR thread saying that this will need dealing with (that PR is not yer merged).

@d-v-b
Copy link
Contributor

d-v-b commented Jan 21, 2025

great, thanks for working on this @ilan-gold and @martindurant

@martindurant martindurant merged commit a260ae9 into zarr-developers:main Jan 21, 2025
30 checks passed
@ilan-gold ilan-gold deleted the ig/structured_arrays_v2 branch January 22, 2025 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants