-
-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prototype of new DType interface #2750
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
from zarr.core.dtype.core import ( | ||
ZarrDType | ||
) | ||
|
||
__all__ = [ | ||
"ZarrDType" | ||
] |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,204 @@ | ||||||
""" | ||||||
# Overview | ||||||
|
||||||
This module provides a proof-of-concept standalone interface for managing dtypes in the zarr-python codebase. | ||||||
|
||||||
The `ZarrDType` class introduced in this module effectively acts as a replacement for `np.dtype` throughout the | ||||||
zarr-python codebase. It attempts to encapsulate all relevant runtime information necessary for working with | ||||||
dtypes in the context of the Zarr V3 specification (e.g. is this a core dtype or not, how many bytes and what | ||||||
endianness is the dtype etc). By providing this abstraction, the module aims to: | ||||||
|
||||||
- Simplify dtype management within zarr-python | ||||||
- Support runtime flexibility and custom extensions | ||||||
- Remove unnecessary dependencies on the numpy API | ||||||
|
||||||
## Extensibility | ||||||
|
||||||
The module attempts to support user-driven extensions, allowing developers to introduce custom dtypes | ||||||
without requiring immediate changes to zarr-python. Extensions can leverage the current entrypoint mechanism, | ||||||
enabling integration of experimental features. Over time, widely adopted extensions may be formalized through | ||||||
inclusion in zarr-python or standardized via a Zarr Enhancement Proposal (ZEP), but this is not essential. | ||||||
|
||||||
## Examples | ||||||
|
||||||
### Core `dtype` Registration | ||||||
|
||||||
The following example demonstrates how to register a built-in `dtype` in the core codebase: | ||||||
|
||||||
```python | ||||||
from zarr.core.dtype import ZarrDType | ||||||
from zarr.registry import register_v3dtype | ||||||
|
||||||
class Float16(ZarrDType): | ||||||
zarr_spec_format = "3" | ||||||
experimental = False | ||||||
endianness = "little" | ||||||
byte_count = 2 | ||||||
to_numpy = np.dtype('float16') | ||||||
|
||||||
register_v3dtype(Float16) | ||||||
``` | ||||||
|
||||||
### Entrypoint Extension | ||||||
|
||||||
The following example demonstrates how users can register a new `bfloat16` dtype for Zarr. | ||||||
This approach adheres to the existing Zarr entrypoint pattern as much as possible, ensuring | ||||||
consistency with other extensions. The code below would typically be part of a Python package | ||||||
that specifies the entrypoints for the extension: | ||||||
|
||||||
```python | ||||||
import ml_dtypes | ||||||
from zarr.core.dtype import ZarrDType # User inherits from ZarrDType when creating their dtype | ||||||
|
||||||
class Bfloat16(ZarrDType): | ||||||
zarr_spec_format = "3" | ||||||
experimental = True | ||||||
endianness = "little" | ||||||
byte_count = 2 | ||||||
to_numpy = np.dtype('bfloat16') # Enabled by importing ml_dtypes | ||||||
configuration_v3 = { | ||||||
"version": "example_value", | ||||||
"author": "example_value", | ||||||
"ml_dtypes_version": "example_value" | ||||||
} | ||||||
``` | ||||||
|
||||||
### dtype lookup | ||||||
|
||||||
The following examples demonstrate how to perform a lookup for the relevant ZarrDType, given | ||||||
a string that matches the dtype Zarr specification ID, or a numpy dtype object: | ||||||
|
||||||
``` | ||||||
from zarr.registry import get_v3dtype_class, get_v3dtype_class_from_numpy | ||||||
|
||||||
get_v3dtype_class('complex64') # returns little-endian Complex64 ZarrDType | ||||||
get_v3dtype_class('not_registered_dtype') # ValueError | ||||||
|
||||||
get_v3dtype_class_from_numpy('>i2') # returns big-endian Int16 ZarrDType | ||||||
get_v3dtype_class_from_numpy(np.dtype('float32')) # returns little-endian Float32 ZarrDType | ||||||
get_v3dtype_class_from_numpy('i10') # ValueError | ||||||
``` | ||||||
|
||||||
### String dtypes | ||||||
|
||||||
The following indicates one possibility for supporting variable-length strings. It is via the | ||||||
entrypoint mechanism as in a previous example. The Apache Arrow specification does not currently | ||||||
include a dtype for fixed-length strings (only for fixed-length bytes) and so I am using string | ||||||
here to implicitly refer to a variable-length string data (there may be some subtleties with codecs | ||||||
that means this needs to be refined further): | ||||||
|
||||||
```python | ||||||
import numpy as np | ||||||
from zarr.core.dtype import ZarrDType # User inherits from ZarrDType when creating their dtype | ||||||
|
||||||
try: | ||||||
to_numpy = np.dtypes.StringDType() | ||||||
except AttributeError: | ||||||
to_numpy = np.dtypes.ObjectDType() | ||||||
|
||||||
class String(ZarrDType): | ||||||
zarr_spec_format = "3" | ||||||
experimental = True | ||||||
endianness = 'little' | ||||||
byte_count = None # None is defined to mean variable | ||||||
to_numpy = to_numpy | ||||||
``` | ||||||
|
||||||
### int4 dtype | ||||||
|
||||||
There is currently considerable interest in the AI community in 'quantising' models - storing | ||||||
models at reduced precision, while minimising loss of information content. There are a number | ||||||
of sub-byte dtypes that the community are using e.g. int4. Unfortunately numpy does not | ||||||
currently have support for handling such sub-byte dtypes in an easy way. However, they can | ||||||
still be held in a numpy array and then passed (in a zero-copy way) to something like pytorch | ||||||
which can handle appropriately: | ||||||
|
||||||
```python | ||||||
import numpy as np | ||||||
from zarr.core.dtype import ZarrDType # User inherits from ZarrDType when creating their dtype | ||||||
|
||||||
class Int4(ZarrDType): | ||||||
zarr_spec_format = "3" | ||||||
experimental = True | ||||||
endianness = 'little' | ||||||
byte_count = 1 # this is ugly, but I could change this from byte_count to bit_count if there was consensus | ||||||
to_numpy = np.dtype('B') # could also be np.dtype('V1'), but this would prevent bit-twiddling | ||||||
configuration_v3 = { | ||||||
"version": "example_value", | ||||||
"author": "example_value", | ||||||
} | ||||||
``` | ||||||
""" | ||||||
|
||||||
from __future__ import annotations | ||||||
|
||||||
from typing import Any, Literal | ||||||
|
||||||
import numpy as np | ||||||
|
||||||
|
||||||
# perhaps over-complicating, but I don't want to allow the attributes to be patched | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Implementation detail: I decided that to try and freeze the class attributes after a dtype has been created. I used metaclasses for this. It's not essential. |
||||||
class FrozenClassVariables(type): | ||||||
def __setattr__(cls, attr, value): | ||||||
if hasattr(cls, attr): | ||||||
raise ValueError( | ||||||
f"Attribute {attr} on ZarrDType class can not be changed once set." | ||||||
) | ||||||
|
||||||
|
||||||
class ZarrDType(metaclass=FrozenClassVariables): | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The most important thing here IMO is that I would like to replace all statements like I have included the attributes that I currently believe are necessary. But some may be unnuecesary, and I may have forgotten others. It's a first attempt! |
||||||
|
||||||
zarr_spec_format: Literal["2", "3"] # the version of the zarr spec used | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
experimental: bool # is this in the core spec or not | ||||||
endianness: Literal[ | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Zarr V3 has made the decision to use a codec for endianness. Endianness is not to be attached to the dtype. Currently, I think that the practical solution is for I can elaborate on this with examples if helpful, but basically, endianness would just be an implementation detail for |
||||||
"big", "little", None | ||||||
] # None indicates not defined i.e. single byte or byte strings | ||||||
byte_count: int | None # None indicates variable count | ||||||
to_numpy: np.dtype[ | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See the |
||||||
Any | ||||||
] # may involve installing a a numpy extension e.g. ml_dtypes; | ||||||
|
||||||
configuration_v3: ( | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wasn't clear to me how this is intended to be used in the spec... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Basically, dtypes can be represented in the json metadata as a short-hand ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For example, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, interesting, thanks! My current thinking is that every dtype that is not in the core spec should include a Is the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Assigning names for extensions, such as dtypes, is something that the Zarr core spec should define to coordinate between the different Zarr implementations. However, there are some gaps in the spec that provide clear guidance for doing that. In the Zarr steering council, we are currently evaluating different options that we will propose to the community, shortly. Our goal is to achieve a naming mechanism that avoids naming conflicts. Our current favorite is to have 2 types of names:
Not necessarily. I think this information would be better placed in specification documents of the dtypes. |
||||||
dict | None | ||||||
) # TODO: understand better how this is recommended by the spec | ||||||
|
||||||
_zarr_spec_identifier: str # implementation detail used to map to core spec | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess this would be mandatory? Or, how would the identifier for the metadata be specified otherwise? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TL;DR Details Example class Float16(ZarrDType):
zarr_spec_format = "3"
experimental = False
endianness = "big"
byte_count = 2
to_numpy = np.dtype('float16') This would generate It's probably a bit clumsy in it's current implementation. I ended up with this pattern i) as a way of tracking the in-memory endianness of the dtype and ii) making sure the class name stays consistent with the identifier - the class name pops up a few times in the entrypoint registration logic, and I didn't want to have a situation where a class name could have a different value to the spec identifier. Obviously, the convention is that a class needs to be named according to how the dtype is identified in the spec. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I wrote in my other comment, we generally want to follow some guidelines around how and what metadata goes into the In any case, maybe I missed it, why do we need to persist the endianness as part of the dtype? Shouldn't that be handled by the appropriate array-to-bytes codec (e.g. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The confusion is my fault - I should probably have made it more clear what metadata I was proposing to serialise to disk. I am proposing to serialise metadata to disk exactly as the current V3 specs outline i.e. only the dtype identifier and the (But see our discussion above on All other attributes here (
This was a suggestion on my part, and may not actually turn out to be helpful. A lot of As you pointed out, it seems likely that this could be handled through a runtime configuration ( It might be the case that I need to flesh out a complete implementation to try and see what both options look like. But I think it seems likely that there will need to be some way to keep track of the runtime endianness. And to be clear, in this dtype implementation, I'm not proposing to serialise the information in the |
||||||
|
||||||
def __init_subclass__( # enforces all required fields are set and basic sanity checks | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Implementation detail: I thought it would be helpful to prevent class creation unless all attributes were defined. |
||||||
cls, | ||||||
**kwargs, | ||||||
) -> None: | ||||||
|
||||||
required_attrs = [ | ||||||
"zarr_spec_format", | ||||||
"experimental", | ||||||
"endianness", | ||||||
"byte_count", | ||||||
"to_numpy", | ||||||
] | ||||||
for attr in required_attrs: | ||||||
if not hasattr(cls, attr): | ||||||
raise ValueError(f"{attr} is a required attribute for a Zarr dtype.") | ||||||
|
||||||
if not hasattr(cls, "configuration_v3"): | ||||||
cls.configuration_v3 = None | ||||||
|
||||||
cls._zarr_spec_identifier = ( | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again, the way I am proposing endianness is just as an implementation detail for |
||||||
"big_" + cls.__qualname__.lower() | ||||||
if cls.endianness == "big" | ||||||
else cls.__qualname__.lower() | ||||||
) # how this dtype is identified in core spec; convention is prefix with big_ for big-endian | ||||||
|
||||||
cls._validate() # sanity check on basic requirements | ||||||
|
||||||
super().__init_subclass__(**kwargs) | ||||||
|
||||||
# TODO: add further checks | ||||||
@classmethod | ||||||
def _validate(cls): | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just an example of the sort of validation that could happen. |
||||||
|
||||||
if cls.byte_count is not None and cls.byte_count <= 0: | ||||||
raise ValueError("byte_count must be a positive integer.") | ||||||
|
||||||
if cls.byte_count == 1 and cls.endianness is not None: | ||||||
raise ValueError("Endianness must be None for single-byte types.") | ||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly the same information as what is included in the PR conversation.