-
-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prototype of new DType interface #2750
base: main
Are you sure you want to change the base?
Conversation
@@ -0,0 +1,204 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly the same information as what is included in the PR conversation.
import numpy as np | ||
|
||
|
||
# perhaps over-complicating, but I don't want to allow the attributes to be patched |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation detail: I decided that to try and freeze the class attributes after a dtype has been created. I used metaclasses for this. It's not essential.
) | ||
|
||
|
||
class ZarrDType(metaclass=FrozenClassVariables): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most important thing here IMO is that ZarrDType
should contain all attributes required when introspecting dtypes at runtime.
I would like to replace all statements like np.dtype.kind in ["S", "U"]
or np.dtype.itemsize > 0
in the codebase with statements like if ZarrDType.byte_count > 0
etc. Basically, replacing the numpy
dtype API with a new zarr-specific API.
I have included the attributes that I currently believe are necessary. But some may be unnuecesary, and I may have forgotten others. It's a first attempt!
|
||
zarr_spec_format: Literal["2", "3"] # the version of the zarr spec used | ||
experimental: bool # is this in the core spec or not | ||
endianness: Literal[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zarr V3 has made the decision to use a codec for endianness. Endianness is not to be attached to the dtype.
This creates some problems for the Zarr API, which is still linked to numpy
's API in a number of ways, including
the ability to create in-memory arrays of arbitrary endianness.
Currently, I think that the practical solution is for zarr-python
to have dtypes that distinguish between big and little endianess in memory, but that when serialised to disk, always serialise the little endian dtype.
I can elaborate on this with examples if helpful, but basically, endianness would just be an implementation detail for zarr-python
that would allow it to track the endianness of an object in memory, and it wouldn't actually be used when serialising to disk.
"big", "little", None | ||
] # None indicates not defined i.e. single byte or byte strings | ||
byte_count: int | None # None indicates variable count | ||
to_numpy: np.dtype[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the bfloat16
example about how this might require new packages to be installed.
Any | ||
] # may involve installing a a numpy extension e.g. ml_dtypes; | ||
|
||
configuration_v3: ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't clear to me how this is intended to be used in the spec...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, dtypes can be represented in the json metadata as a short-hand (str
) or dict ({ "name": str, "configuration": None | { ... } }
. The configuration
key is optional and could be used for dtypes that need additional configuration. If there is no configuration
key, the short-hand version is equivalent to the dict with just a name
key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, bfloat16
is equivalent to {"name":"bfloat16"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, interesting, thanks!
My current thinking is that every dtype that is not in the core spec should include a configuration
key. I would like to introduce a convention where extension dtypes also provide metadata like 'author', 'extension_version', etc., to give the best chance of reproducibility/re-use in the future. At least, until an extension dtype becomes a core dtype.
Is the configuration
key an appropriate location for such metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assigning names for extensions, such as dtypes, is something that the Zarr core spec should define to coordinate between the different Zarr implementations. However, there are some gaps in the spec that provide clear guidance for doing that. In the Zarr steering council, we are currently evaluating different options that we will propose to the community, shortly. Our goal is to achieve a naming mechanism that avoids naming conflicts. Our current favorite is to have 2 types of names:
- URI-based names, e.g.
https://nenb.github.io/bfloat16
, which can be freely used by anybody who reasonably controls the URI. The URI doesn't need to resolve to anything; it is just a name. However, it makes sense to have some useful information under the URI, e.g. a spec document. - Raw names, e.g.
bfloat16
, which would be assigned through a centralized registry (e.g. a git repo) through a to-be-defined process. This will entail a bit more process than the URI-based names and will come with some expectations w.r.t. backwards compatibility.
Is the configuration key an appropriate location for such metadata?
Not necessarily. I think this information would be better placed in specification documents of the dtypes.
|
||
_zarr_spec_identifier: str # implementation detail used to map to core spec | ||
|
||
def __init_subclass__( # enforces all required fields are set and basic sanity checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation detail: I thought it would be helpful to prevent class creation unless all attributes were defined.
if not hasattr(cls, "configuration_v3"): | ||
cls.configuration_v3 = None | ||
|
||
cls._zarr_spec_identifier = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, the way I am proposing endianness is just as an implementation detail for zarr-python
to track the endianness of in-memory objects. When serialised to disk, this big_
prefix would always be removed.
|
||
# TODO: add further checks | ||
@classmethod | ||
def _validate(cls): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an example of the sort of validation that could happen.
@@ -5,11 +5,15 @@ | |||
from importlib.metadata import entry_points as get_entry_points | |||
from typing import TYPE_CHECKING, Any, Generic, TypeVar | |||
|
|||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in this module are a bit hacky. They are mainly just to show how something might work - the actual details can certainly be changed quite a bit.
self.register(e.load()) | ||
cls = e.load() | ||
|
||
if (hasattr(cls, "_zarr_spec_identifier") and hasattr(cls, "to_numpy")): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation here is that I don't want a user to be able to add a dtype whose numpy
representation overrides an already registered dtype e.g. MyFakeBool should not be able to override the core Bool dtype.
The exceptions are for the V and B dtypes, because these are raw byte strings, and multiple dtypes might be mapped to this e.g. int2
and int4
might both be mapped to the 'B' dtype in numpy
because numpy
has no support for sub-byte dtypes.
if class_registry_key is None: | ||
self[fully_qualified_name(cls)] = cls | ||
else: | ||
if class_registry_key in self: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, a bit hacky, but I don't want a user to be able to override a core dtype.
__v3_dtype_registry.register(cls, class_registry_key=cls._zarr_spec_identifier) | ||
|
||
|
||
def register_v2dtype(cls: type[ZarrDType]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we could also backport this to Zarr V2? I'll admit that I haven't really researched this much though. Putting it here just to start a discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is a strong desire in the community to further evolve the v2 spec.
@@ -266,4 +328,50 @@ def get_ndbuffer_class(reload_config: bool = False) -> type[NDBuffer]: | |||
) | |||
|
|||
|
|||
# TODO: merge the get_vXdtype_class_ functions | |||
# these can be used instead of the various parse_X functions (hopefully) | |||
def get_v3dtype_class(dtype: str) -> type[ZarrDType]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the examples for how these can be used. As in the comment, the idea is for these to replace the various parse_X
functions in the codebase.
@nenb thank you so much for working on this. I really like the ideas in here -- it looks like a great addition to the library, and I will try to give this close examination, and I recommend the other @zarr-developers/python-core-devs do the same. This is a big, forward-looking change so I imagine we will have a lot of opinions on multiple scales here. Our weekly dev meeting is tomorrow (here's the calendar info, the particular meeting I'm referring to is the zarr-python developers meeting), if the timing works for you, would it be possible for you to attend and give an overview of this PR? |
|
||
class ZarrDType(metaclass=FrozenClassVariables): | ||
|
||
zarr_spec_format: Literal["2", "3"] # the version of the zarr spec used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zarr_spec_format: Literal["2", "3"] # the version of the zarr spec used | |
zarr_format: Literal["2", "3"] # the version of the zarr spec used |
dict | None | ||
) # TODO: understand better how this is recommended by the spec | ||
|
||
_zarr_spec_identifier: str # implementation detail used to map to core spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this would be mandatory? Or, how would the identifier for the metadata be specified otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR
This isn't an essential detail, and the implementation can probably be considerably improved if there is consensus around adding a new dtype interface. I've included some details below about how I ended up with this attribute if you are interested though.
Details
At the moment, this is generated from the class name (see the logic in __init_subclass__
). The user doesn't specify it, rather the user specifies the class name and the identifier is generated from the class name (potentially prefixed with big_
).
Example
class Float16(ZarrDType):
zarr_spec_format = "3"
experimental = False
endianness = "big"
byte_count = 2
to_numpy = np.dtype('float16')
This would generate big_float16
for _zarr_spec_identifier
.
It's probably a bit clumsy in it's current implementation. I ended up with this pattern i) as a way of tracking the in-memory endianness of the dtype and ii) making sure the class name stays consistent with the identifier - the class name pops up a few times in the entrypoint registration logic, and I didn't want to have a situation where a class name could have a different value to the spec identifier.
Obviously, the convention is that a class needs to be named according to how the dtype is identified in the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I wrote in my other comment, we generally want to follow some guidelines around how and what metadata goes into the zarr.json
files to achieve interoperability with other libraries/languages that implement Zarr. Unfortunately, some of these guidelines still need to be specced out.
In any case, maybe I missed it, why do we need to persist the endianness as part of the dtype? Shouldn't that be handled by the appropriate array-to-bytes codec (e.g. bytes
)? The runtime endianness might need to be handled through runtime configuration, see the ArrayConfig
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The confusion is my fault - I should probably have made it more clear what metadata I was proposing to serialise to disk.
I am proposing to serialise metadata to disk exactly as the current V3 specs outline i.e. only the dtype identifier and the configuration
object.
(But see our discussion above on configuration
, I am still learning how it is intended to be used.)
All other attributes here (endianness
, experimental
, etc) are implementation details to express intent to zarr-python
at runtime. This is already done by the numpy
dtype API in zarr-python
e.g. the use of statements like np.dtype.kind in ["U", "S"]
. But this numpy
API has limitations e.g. it can't recognise new dtypes like bfloat16
correctly. Which is a large reason why I am proposing that zarr-python
have its own dtype interface in this PoC.
In any case, maybe I missed it, why do we need to persist the endianness as part of the dtype?
This was a suggestion on my part, and may not actually turn out to be helpful. A lot of zarr
code does things like dtype='>i2'
- a good example from the current issues. There will need to be a way of tracking this runtime endianness in Zarr V3.
As you pointed out, it seems likely that this could be handled through a runtime configuration (ArrayConfig
). But it felt more natural (to me) to keep track of this information on the dtype itself.
It might be the case that I need to flesh out a complete implementation to try and see what both options look like. But I think it seems likely that there will need to be some way to keep track of the runtime endianness.
And to be clear, in this dtype implementation, I'm not proposing to serialise the information in the endianness
attribute to disk in the zarr metadata. It would purely be an implementation detail that zarr-python
would use to keep track of runtime endianness.
@d-v-b Thanks for being so receptive! I'm happy to attend and give an overview. What's likely the best format to present at the meeting (I haven't attended before). If I give a 5 minute overview (including some motivation from the AI-side of things), and then leave 10 minutes for questions, would that work? Is there anything specific that you think it's worth focusing on e.g. extension mechanism? |
@nenb that plan sounds good. I'm looking forward to it! |
This is a PoC for a new standalone interface for managing dtypes in the
zarr-python
codebase.Among other things, it tries to provide support for adding dtype extensions, as outlined by the Zarr V3 spec.
I have provided several examples below of how it is currently intended to be used. I will also add a number of comments to the files in this PR to help further clarify intent.
Examples
Core dtype registration
The following example demonstrates how to register a built-in
dtype
in the core codebase:Entrypoint extension
The following example demonstrates how users can register a new
bfloat16
dtype for Zarr. This approach adheres to the existing Zarr entrypoint pattern as much as possible, ensuring consistency with other extensions. The code below would typically be part of a Python package that specifies the entrypoints for the extension:dtype lookup
The following examples demonstrate how to perform a lookup for the relevant
ZarrDType
inside thezarr-python
codebase, given a string that matches the Zarr specification ID for the dtype, or anumpy
dtype object:String dtypes
The following indicates one possibility for supporting variable-length strings. It is via the entrypoint mechanism as in a previous example. The Apache Arrow specification does not currently include a dtype for fixed-length strings (only for fixed-length bytes) and so I am using string here to implicitly refer to a variable-length string data (there may be some subtleties with codecs that means this needs to be refined further):
int4 dtype
There is currently considerable interest in the AI community in 'quantising' models - storing models at reduced precision, while minimising loss of information content. There are a number of sub-byte dtypes that the community are using e.g. int4. Unfortunately
numpy
does not currently have much support for sub-byte dtypes. However, raw bits/bytes can still be held in anumpy
array and then passed (in a zero-copy way) to something likepytorch
which can handle more conveniently: