Prototype of new DType interface #2750

nenb · 2025-01-22T22:26:22Z

This is a PoC for a new standalone interface for managing dtypes in the zarr-python codebase.

Among other things, it tries to provide support for adding dtype extensions, as outlined by the Zarr V3 spec.

I have provided several examples below of how it is currently intended to be used. I will also add a number of comments to the files in this PR to help further clarify intent.

Examples

Core dtype registration

The following example demonstrates how to register a built-in dtype in the core codebase:

from zarr.core.dtype import ZarrDType
from zarr.registry import register_v3dtype

class Float16(ZarrDType):
    zarr_spec_format = "3"
    experimental = False
    endianness = "little"
    byte_count = 2
    to_numpy = np.dtype('float16')

register_v3dtype(Float16)

Entrypoint extension

The following example demonstrates how users can register a new bfloat16 dtype for Zarr. This approach adheres to the existing Zarr entrypoint pattern as much as possible, ensuring consistency with other extensions. The code below would typically be part of a Python package that specifies the entrypoints for the extension:

import ml_dtypes
import numpy as np
from zarr.core.dtype import ZarrDType  # User inherits from ZarrDType when creating their dtype

class Bfloat16(ZarrDType):
    zarr_spec_format = "3"
    experimental = True
    endianness = "little"
    byte_count = 2
    to_numpy = np.dtype('bfloat16')  # Enabled by importing ml_dtypes
    configuration_v3 = {
        "version": "example_value",
        "author": "example_value",
        "ml_dtypes_version": "example_value"
    }

dtype lookup

The following examples demonstrate how to perform a lookup for the relevant ZarrDType inside the zarr-python codebase, given a string that matches the Zarr specification ID for the dtype, or a numpy dtype object:

from zarr.registry import get_v3dtype_class, get_v3dtype_class_from_numpy

get_v3dtype_class('complex64')  # returns little-endian Complex64 ZarrDType
get_v3dtype_class('not_registered_dtype')  # ValueError

get_v3dtype_class_from_numpy('>i2')  # returns big-endian Int16 ZarrDType
get_v3dtype_class_from_numpy(np.dtype('float32'))  # returns little-endian Float32 ZarrDType
get_v3dtype_class_from_numpy('i10')  # ValueError

String dtypes

The following indicates one possibility for supporting variable-length strings. It is via the entrypoint mechanism as in a previous example. The Apache Arrow specification does not currently include a dtype for fixed-length strings (only for fixed-length bytes) and so I am using string here to implicitly refer to a variable-length string data (there may be some subtleties with codecs that means this needs to be refined further):

import numpy as np
from zarr.core.dtype import ZarrDType  # User inherits from ZarrDType when creating their dtype

try:
    to_numpy = np.dtypes.StringDType()
except AttributeError:
    to_numpy = np.dtypes.ObjectDType()

class String(ZarrDType):
    zarr_spec_format = "3"
    experimental = True
    endianness = 'little'
    byte_count = None  # None is defined to mean variable
    to_numpy = to_numpy

int4 dtype

There is currently considerable interest in the AI community in 'quantising' models - storing models at reduced precision, while minimising loss of information content. There are a number of sub-byte dtypes that the community are using e.g. int4. Unfortunately numpy does not currently have much support for sub-byte dtypes. However, raw bits/bytes can still be held in a numpy array and then passed (in a zero-copy way) to something like pytorch which can handle more conveniently:

import numpy as np
from zarr.core.dtype import ZarrDType  # User inherits from ZarrDType when creating their dtype

class Int4(ZarrDType):
    zarr_spec_format = "3"
    experimental = True
    endianness = 'little'
    byte_count = 1  # this is ugly, but I could change this from byte_count to bit_count if there was consensus 
    to_numpy = np.dtype('B')  # could also be np.dtype('V1'), but this would prevent bit-twiddling
    configuration_v3 = {
        "version": "example_value",
        "author": "example_value",
    }

nenb · 2025-01-22T22:27:25Z

src/zarr/core/dtype/core.py

@@ -0,0 +1,204 @@
+"""


Mostly the same information as what is included in the PR conversation.

nenb · 2025-01-22T22:28:33Z

src/zarr/core/dtype/core.py

+import numpy as np
+
+
+# perhaps over-complicating, but I don't want to allow the attributes to be patched


Implementation detail: I decided that to try and freeze the class attributes after a dtype has been created. I used metaclasses for this. It's not essential.

nenb · 2025-01-22T22:32:24Z

src/zarr/core/dtype/core.py

+            )
+
+
+class ZarrDType(metaclass=FrozenClassVariables):


The most important thing here IMO is that ZarrDType should contain all attributes required when introspecting dtypes at runtime.

I would like to replace all statements like np.dtype.kind in ["S", "U"] or np.dtype.itemsize > 0 in the codebase with statements like if ZarrDType.byte_count > 0 etc. Basically, replacing the numpy dtype API with a new zarr-specific API.

I have included the attributes that I currently believe are necessary. But some may be unnuecesary, and I may have forgotten others. It's a first attempt!

nenb · 2025-01-22T22:35:05Z

src/zarr/core/dtype/core.py

+
+    zarr_spec_format: Literal["2", "3"]  # the version of the zarr spec used
+    experimental: bool  # is this in the core spec or not
+    endianness: Literal[


Zarr V3 has made the decision to use a codec for endianness. Endianness is not to be attached to the dtype.
This creates some problems for the Zarr API, which is still linked to numpy's API in a number of ways, including
the ability to create in-memory arrays of arbitrary endianness.

Currently, I think that the practical solution is for zarr-python to have dtypes that distinguish between big and little endianess in memory, but that when serialised to disk, always serialise the little endian dtype.

I can elaborate on this with examples if helpful, but basically, endianness would just be an implementation detail for zarr-python that would allow it to track the endianness of an object in memory, and it wouldn't actually be used when serialising to disk.

nenb · 2025-01-22T22:36:08Z

src/zarr/core/dtype/core.py

+        "big", "little", None
+    ]  # None indicates not defined i.e. single byte or byte strings
+    byte_count: int | None  # None indicates variable count
+    to_numpy: np.dtype[


See the bfloat16 example about how this might require new packages to be installed.

nenb · 2025-01-22T22:36:27Z

src/zarr/core/dtype/core.py

+        Any
+    ]  # may involve installing a a numpy extension e.g. ml_dtypes;
+
+    configuration_v3: (


Wasn't clear to me how this is intended to be used in the spec...

Basically, dtypes can be represented in the json metadata as a short-hand (str) or dict ({ "name": str, "configuration": None | { ... } }. The configuration key is optional and could be used for dtypes that need additional configuration. If there is no configuration key, the short-hand version is equivalent to the dict with just a name key.

For example, bfloat16 is equivalent to {"name":"bfloat16"}

Ah, interesting, thanks!

My current thinking is that every dtype that is not in the core spec should include a configuration key. I would like to introduce a convention where extension dtypes also provide metadata like 'author', 'extension_version', etc., to give the best chance of reproducibility/re-use in the future. At least, until an extension dtype becomes a core dtype.

Is the configuration key an appropriate location for such metadata?

Assigning names for extensions, such as dtypes, is something that the Zarr core spec should define to coordinate between the different Zarr implementations. However, there are some gaps in the spec that provide clear guidance for doing that. In the Zarr steering council, we are currently evaluating different options that we will propose to the community, shortly. Our goal is to achieve a naming mechanism that avoids naming conflicts. Our current favorite is to have 2 types of names:

URI-based names, e.g. https://nenb.github.io/bfloat16, which can be freely used by anybody who reasonably controls the URI. The URI doesn't need to resolve to anything; it is just a name. However, it makes sense to have some useful information under the URI, e.g. a spec document.

Raw names, e.g. bfloat16, which would be assigned through a centralized registry (e.g. a git repo) through a to-be-defined process. This will entail a bit more process than the URI-based names and will come with some expectations w.r.t. backwards compatibility.

Is the configuration key an appropriate location for such metadata?

Not necessarily. I think this information would be better placed in specification documents of the dtypes.

nenb · 2025-01-22T22:37:07Z

src/zarr/core/dtype/core.py

+
+    _zarr_spec_identifier: str  # implementation detail used to map to core spec
+
+    def __init_subclass__(  # enforces all required fields are set and basic sanity checks


Implementation detail: I thought it would be helpful to prevent class creation unless all attributes were defined.

nenb · 2025-01-22T22:38:18Z

src/zarr/core/dtype/core.py

+        if not hasattr(cls, "configuration_v3"):
+            cls.configuration_v3 = None
+
+        cls._zarr_spec_identifier = (


Again, the way I am proposing endianness is just as an implementation detail for zarr-python to track the endianness of in-memory objects. When serialised to disk, this big_ prefix would always be removed.

nenb · 2025-01-22T22:38:43Z

src/zarr/core/dtype/core.py

+
+    # TODO: add further checks
+    @classmethod
+    def _validate(cls):


Just an example of the sort of validation that could happen.

nenb · 2025-01-22T22:39:29Z

src/zarr/registry.py

@@ -5,11 +5,15 @@
 from importlib.metadata import entry_points as get_entry_points
 from typing import TYPE_CHECKING, Any, Generic, TypeVar

+import numpy as np


The changes in this module are a bit hacky. They are mainly just to show how something might work - the actual details can certainly be changed quite a bit.

nenb · 2025-01-22T22:42:00Z

src/zarr/registry.py

-            self.register(e.load())
+            cls = e.load()
+
+            if (hasattr(cls, "_zarr_spec_identifier") and hasattr(cls, "to_numpy")):


The motivation here is that I don't want a user to be able to add a dtype whose numpy representation overrides an already registered dtype e.g. MyFakeBool should not be able to override the core Bool dtype.

The exceptions are for the V and B dtypes, because these are raw byte strings, and multiple dtypes might be mapped to this e.g. int2 and int4 might both be mapped to the 'B' dtype in numpy because numpy has no support for sub-byte dtypes.

nenb · 2025-01-22T22:42:35Z

src/zarr/registry.py

+        if class_registry_key is None:
+            self[fully_qualified_name(cls)] = cls
+        else:
+            if class_registry_key in self:


Again, a bit hacky, but I don't want a user to be able to override a core dtype.

nenb · 2025-01-22T22:43:27Z

src/zarr/registry.py

+    __v3_dtype_registry.register(cls, class_registry_key=cls._zarr_spec_identifier)
+
+
+def register_v2dtype(cls: type[ZarrDType]) -> None:


Ideally, we could also backport this to Zarr V2? I'll admit that I haven't really researched this much though. Putting it here just to start a discussion.

I don't think there is a strong desire in the community to further evolve the v2 spec.

nenb · 2025-01-22T22:44:12Z

src/zarr/registry.py

@@ -266,4 +328,50 @@ def get_ndbuffer_class(reload_config: bool = False) -> type[NDBuffer]:
    )


+# TODO: merge the get_vXdtype_class_ functions
+# these can be used instead of the various parse_X functions (hopefully)
+def get_v3dtype_class(dtype: str) -> type[ZarrDType]:


See the examples for how these can be used. As in the comment, the idea is for these to replace the various parse_X functions in the codebase.

d-v-b · 2025-01-23T10:57:51Z

@nenb thank you so much for working on this. I really like the ideas in here -- it looks like a great addition to the library, and I will try to give this close examination, and I recommend the other @zarr-developers/python-core-devs do the same. This is a big, forward-looking change so I imagine we will have a lot of opinions on multiple scales here.

Our weekly dev meeting is tomorrow (here's the calendar info, the particular meeting I'm referring to is the zarr-python developers meeting), if the timing works for you, would it be possible for you to attend and give an overview of this PR?

normanrz · 2025-01-23T12:03:45Z

src/zarr/core/dtype/core.py

+
+class ZarrDType(metaclass=FrozenClassVariables):
+
+    zarr_spec_format: Literal["2", "3"]  # the version of the zarr spec used


Suggested change

zarr_spec_format: Literal["2", "3"] # the version of the zarr spec used

zarr_format: Literal["2", "3"] # the version of the zarr spec used

normanrz · 2025-01-23T12:04:19Z

src/zarr/core/dtype/core.py

+        dict | None
+    )  # TODO: understand better how this is recommended by the spec
+
+    _zarr_spec_identifier: str  # implementation detail used to map to core spec


I guess this would be mandatory? Or, how would the identifier for the metadata be specified otherwise?

TL;DR
This isn't an essential detail, and the implementation can probably be considerably improved if there is consensus around adding a new dtype interface. I've included some details below about how I ended up with this attribute if you are interested though.

Details
At the moment, this is generated from the class name (see the logic in __init_subclass__). The user doesn't specify it, rather the user specifies the class name and the identifier is generated from the class name (potentially prefixed with big_).

Example

class Float16(ZarrDType): zarr_spec_format = "3" experimental = False endianness = "big" byte_count = 2 to_numpy = np.dtype('float16')

This would generate big_float16 for _zarr_spec_identifier.

It's probably a bit clumsy in it's current implementation. I ended up with this pattern i) as a way of tracking the in-memory endianness of the dtype and ii) making sure the class name stays consistent with the identifier - the class name pops up a few times in the entrypoint registration logic, and I didn't want to have a situation where a class name could have a different value to the spec identifier.

Obviously, the convention is that a class needs to be named according to how the dtype is identified in the spec.

As I wrote in my other comment, we generally want to follow some guidelines around how and what metadata goes into the zarr.json files to achieve interoperability with other libraries/languages that implement Zarr. Unfortunately, some of these guidelines still need to be specced out.

In any case, maybe I missed it, why do we need to persist the endianness as part of the dtype? Shouldn't that be handled by the appropriate array-to-bytes codec (e.g. bytes)? The runtime endianness might need to be handled through runtime configuration, see the ArrayConfig class.

The confusion is my fault - I should probably have made it more clear what metadata I was proposing to serialise to disk.

I am proposing to serialise metadata to disk exactly as the current V3 specs outline i.e. only the dtype identifier and the configuration object.

(But see our discussion above on configuration, I am still learning how it is intended to be used.)

All other attributes here (endianness, experimental, etc) are implementation details to express intent to zarr-python at runtime. This is already done by the numpy dtype API in zarr-python e.g. the use of statements like np.dtype.kind in ["U", "S"]. But this numpy API has limitations e.g. it can't recognise new dtypes like bfloat16 correctly. Which is a large reason why I am proposing that zarr-python have its own dtype interface in this PoC.

In any case, maybe I missed it, why do we need to persist the endianness as part of the dtype?

This was a suggestion on my part, and may not actually turn out to be helpful. A lot of zarr code does things like dtype='>i2' - a good example from the current issues. There will need to be a way of tracking this runtime endianness in Zarr V3.

As you pointed out, it seems likely that this could be handled through a runtime configuration (ArrayConfig). But it felt more natural (to me) to keep track of this information on the dtype itself.

It might be the case that I need to flesh out a complete implementation to try and see what both options look like. But I think it seems likely that there will need to be some way to keep track of the runtime endianness.

And to be clear, in this dtype implementation, I'm not proposing to serialise the information in the endianness attribute to disk in the zarr metadata. It would purely be an implementation detail that zarr-python would use to keep track of runtime endianness.

nenb · 2025-01-23T12:50:25Z

@nenb thank you so much for working on this. I really like the ideas in here -- it looks like a great addition to the library, and I will try to give this close examination, and I recommend the other @zarr-developers/python-core-devs do the same. This is a big, forward-looking change so I imagine we will have a lot of opinions on multiple scales here.

Our weekly dev meeting is tomorrow (here's the calendar info, the particular meeting I'm referring to is the zarr-python developers meeting), if the timing works for you, would it be possible for you to attend and give an overview of this PR?

@d-v-b Thanks for being so receptive! I'm happy to attend and give an overview.

What's likely the best format to present at the meeting (I haven't attended before). If I give a 5 minute overview (including some motivation from the AI-side of things), and then leave 10 minutes for questions, would that work? Is there anything specific that you think it's worth focusing on e.g. extension mechanism?

d-v-b · 2025-01-23T13:11:25Z

@nenb that plan sounds good. I'm looking forward to it!

Prototype of new DType interface

b9c3627

nenb commented Jan 22, 2025

View reviewed changes

src/zarr/core/dtype/core.py

@@ -0,0 +1,204 @@

"""

Copy link

Author

nenb Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly the same information as what is included in the PR conversation.

nenb commented Jan 22, 2025

View reviewed changes

nenb mentioned this pull request Jan 22, 2025

Extend data type support (for bfloat16 in particular) #2656

Open

normanrz reviewed Jan 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prototype of new DType interface #2750

Prototype of new DType interface #2750

nenb commented Jan 22, 2025

nenb Jan 22, 2025

nenb Jan 22, 2025

nenb Jan 22, 2025

nenb Jan 22, 2025

nenb Jan 22, 2025

nenb Jan 22, 2025

normanrz Jan 23, 2025

normanrz Jan 23, 2025

nenb Jan 23, 2025

normanrz Jan 23, 2025

nenb Jan 22, 2025

nenb Jan 22, 2025

nenb Jan 22, 2025

nenb Jan 22, 2025

nenb Jan 22, 2025

nenb Jan 22, 2025

nenb Jan 22, 2025

normanrz Jan 23, 2025

nenb Jan 22, 2025

d-v-b commented Jan 23, 2025

normanrz Jan 23, 2025

normanrz Jan 23, 2025

nenb Jan 23, 2025 •

edited

Loading

normanrz Jan 23, 2025

nenb Jan 23, 2025

nenb commented Jan 23, 2025

d-v-b commented Jan 23, 2025

		import numpy as np


		# perhaps over-complicating, but I don't want to allow the attributes to be patched


		_zarr_spec_identifier: str # implementation detail used to map to core spec

		def __init_subclass__( # enforces all required fields are set and basic sanity checks

		__v3_dtype_registry.register(cls, class_registry_key=cls._zarr_spec_identifier)


		def register_v2dtype(cls: type[ZarrDType]) -> None:


		class ZarrDType(metaclass=FrozenClassVariables):

		zarr_spec_format: Literal["2", "3"] # the version of the zarr spec used

	zarr_spec_format: Literal["2", "3"] # the version of the zarr spec used
	zarr_format: Literal["2", "3"] # the version of the zarr spec used

Prototype of new DType interface #2750

Are you sure you want to change the base?

Prototype of new DType interface #2750

Conversation

nenb commented Jan 22, 2025

Examples

Core dtype registration

Entrypoint extension

dtype lookup

String dtypes

int4 dtype

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d-v-b commented Jan 23, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nenb Jan 23, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nenb commented Jan 23, 2025

d-v-b commented Jan 23, 2025

nenb Jan 23, 2025 •

edited

Loading