Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype of new DType interface #2750

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nenb
Copy link

@nenb nenb commented Jan 22, 2025

This is a PoC for a new standalone interface for managing dtypes in the zarr-python codebase.

Among other things, it tries to provide support for adding dtype extensions, as outlined by the Zarr V3 spec.

I have provided several examples below of how it is currently intended to be used. I will also add a number of comments to the files in this PR to help further clarify intent.

Examples

Core dtype registration

The following example demonstrates how to register a built-in dtype in the core codebase:

from zarr.core.dtype import ZarrDType
from zarr.registry import register_v3dtype

class Float16(ZarrDType):
    zarr_spec_format = "3"
    experimental = False
    endianness = "little"
    byte_count = 2
    to_numpy = np.dtype('float16')

register_v3dtype(Float16)

Entrypoint extension

The following example demonstrates how users can register a new bfloat16 dtype for Zarr. This approach adheres to the existing Zarr entrypoint pattern as much as possible, ensuring consistency with other extensions. The code below would typically be part of a Python package that specifies the entrypoints for the extension:

import ml_dtypes
import numpy as np
from zarr.core.dtype import ZarrDType  # User inherits from ZarrDType when creating their dtype

class Bfloat16(ZarrDType):
    zarr_spec_format = "3"
    experimental = True
    endianness = "little"
    byte_count = 2
    to_numpy = np.dtype('bfloat16')  # Enabled by importing ml_dtypes
    configuration_v3 = {
        "version": "example_value",
        "author": "example_value",
        "ml_dtypes_version": "example_value"
    }

dtype lookup

The following examples demonstrate how to perform a lookup for the relevant ZarrDType inside the zarr-python codebase, given a string that matches the Zarr specification ID for the dtype, or a numpy dtype object:

from zarr.registry import get_v3dtype_class, get_v3dtype_class_from_numpy

get_v3dtype_class('complex64')  # returns little-endian Complex64 ZarrDType
get_v3dtype_class('not_registered_dtype')  # ValueError

get_v3dtype_class_from_numpy('>i2')  # returns big-endian Int16 ZarrDType
get_v3dtype_class_from_numpy(np.dtype('float32'))  # returns little-endian Float32 ZarrDType
get_v3dtype_class_from_numpy('i10')  # ValueError

String dtypes

The following indicates one possibility for supporting variable-length strings. It is via the entrypoint mechanism as in a previous example. The Apache Arrow specification does not currently include a dtype for fixed-length strings (only for fixed-length bytes) and so I am using string here to implicitly refer to a variable-length string data (there may be some subtleties with codecs that means this needs to be refined further):

import numpy as np
from zarr.core.dtype import ZarrDType  # User inherits from ZarrDType when creating their dtype

try:
    to_numpy = np.dtypes.StringDType()
except AttributeError:
    to_numpy = np.dtypes.ObjectDType()

class String(ZarrDType):
    zarr_spec_format = "3"
    experimental = True
    endianness = 'little'
    byte_count = None  # None is defined to mean variable
    to_numpy = to_numpy

int4 dtype

There is currently considerable interest in the AI community in 'quantising' models - storing models at reduced precision, while minimising loss of information content. There are a number of sub-byte dtypes that the community are using e.g. int4. Unfortunately numpy does not currently have much support for sub-byte dtypes. However, raw bits/bytes can still be held in a numpy array and then passed (in a zero-copy way) to something like pytorch which can handle more conveniently:

import numpy as np
from zarr.core.dtype import ZarrDType  # User inherits from ZarrDType when creating their dtype

class Int4(ZarrDType):
    zarr_spec_format = "3"
    experimental = True
    endianness = 'little'
    byte_count = 1  # this is ugly, but I could change this from byte_count to bit_count if there was consensus 
    to_numpy = np.dtype('B')  # could also be np.dtype('V1'), but this would prevent bit-twiddling
    configuration_v3 = {
        "version": "example_value",
        "author": "example_value",
    }

@@ -0,0 +1,204 @@
"""
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly the same information as what is included in the PR conversation.

import numpy as np


# perhaps over-complicating, but I don't want to allow the attributes to be patched
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation detail: I decided that to try and freeze the class attributes after a dtype has been created. I used metaclasses for this. It's not essential.

)


class ZarrDType(metaclass=FrozenClassVariables):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most important thing here IMO is that ZarrDType should contain all attributes required when introspecting dtypes at runtime.

I would like to replace all statements like np.dtype.kind in ["S", "U"] or np.dtype.itemsize > 0 in the codebase with statements like if ZarrDType.byte_count > 0 etc. Basically, replacing the numpy dtype API with a new zarr-specific API.

I have included the attributes that I currently believe are necessary. But some may be unnuecesary, and I may have forgotten others. It's a first attempt!


zarr_spec_format: Literal["2", "3"] # the version of the zarr spec used
experimental: bool # is this in the core spec or not
endianness: Literal[
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zarr V3 has made the decision to use a codec for endianness. Endianness is not to be attached to the dtype.
This creates some problems for the Zarr API, which is still linked to numpy's API in a number of ways, including
the ability to create in-memory arrays of arbitrary endianness.

Currently, I think that the practical solution is for zarr-python to have dtypes that distinguish between big and little endianess in memory, but that when serialised to disk, always serialise the little endian dtype.

I can elaborate on this with examples if helpful, but basically, endianness would just be an implementation detail for zarr-python that would allow it to track the endianness of an object in memory, and it wouldn't actually be used when serialising to disk.

"big", "little", None
] # None indicates not defined i.e. single byte or byte strings
byte_count: int | None # None indicates variable count
to_numpy: np.dtype[
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the bfloat16 example about how this might require new packages to be installed.

Any
] # may involve installing a a numpy extension e.g. ml_dtypes;

configuration_v3: (
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't clear to me how this is intended to be used in the spec...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, dtypes can be represented in the json metadata as a short-hand (str) or dict ({ "name": str, "configuration": None | { ... } }. The configuration key is optional and could be used for dtypes that need additional configuration. If there is no configuration key, the short-hand version is equivalent to the dict with just a name key.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, bfloat16 is equivalent to {"name":"bfloat16"}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, interesting, thanks!

My current thinking is that every dtype that is not in the core spec should include a configuration key. I would like to introduce a convention where extension dtypes also provide metadata like 'author', 'extension_version', etc., to give the best chance of reproducibility/re-use in the future. At least, until an extension dtype becomes a core dtype.

Is the configuration key an appropriate location for such metadata?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assigning names for extensions, such as dtypes, is something that the Zarr core spec should define to coordinate between the different Zarr implementations. However, there are some gaps in the spec that provide clear guidance for doing that. In the Zarr steering council, we are currently evaluating different options that we will propose to the community, shortly. Our goal is to achieve a naming mechanism that avoids naming conflicts. Our current favorite is to have 2 types of names:

  • URI-based names, e.g. https://nenb.github.io/bfloat16, which can be freely used by anybody who reasonably controls the URI. The URI doesn't need to resolve to anything; it is just a name. However, it makes sense to have some useful information under the URI, e.g. a spec document.
  • Raw names, e.g. bfloat16, which would be assigned through a centralized registry (e.g. a git repo) through a to-be-defined process. This will entail a bit more process than the URI-based names and will come with some expectations w.r.t. backwards compatibility.

Is the configuration key an appropriate location for such metadata?

Not necessarily. I think this information would be better placed in specification documents of the dtypes.


_zarr_spec_identifier: str # implementation detail used to map to core spec

def __init_subclass__( # enforces all required fields are set and basic sanity checks
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation detail: I thought it would be helpful to prevent class creation unless all attributes were defined.

if not hasattr(cls, "configuration_v3"):
cls.configuration_v3 = None

cls._zarr_spec_identifier = (
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the way I am proposing endianness is just as an implementation detail for zarr-python to track the endianness of in-memory objects. When serialised to disk, this big_ prefix would always be removed.


# TODO: add further checks
@classmethod
def _validate(cls):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an example of the sort of validation that could happen.

@@ -5,11 +5,15 @@
from importlib.metadata import entry_points as get_entry_points
from typing import TYPE_CHECKING, Any, Generic, TypeVar

import numpy as np
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this module are a bit hacky. They are mainly just to show how something might work - the actual details can certainly be changed quite a bit.

self.register(e.load())
cls = e.load()

if (hasattr(cls, "_zarr_spec_identifier") and hasattr(cls, "to_numpy")):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation here is that I don't want a user to be able to add a dtype whose numpy representation overrides an already registered dtype e.g. MyFakeBool should not be able to override the core Bool dtype.

The exceptions are for the V and B dtypes, because these are raw byte strings, and multiple dtypes might be mapped to this e.g. int2 and int4 might both be mapped to the 'B' dtype in numpy because numpy has no support for sub-byte dtypes.

if class_registry_key is None:
self[fully_qualified_name(cls)] = cls
else:
if class_registry_key in self:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, a bit hacky, but I don't want a user to be able to override a core dtype.

__v3_dtype_registry.register(cls, class_registry_key=cls._zarr_spec_identifier)


def register_v2dtype(cls: type[ZarrDType]) -> None:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we could also backport this to Zarr V2? I'll admit that I haven't really researched this much though. Putting it here just to start a discussion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a strong desire in the community to further evolve the v2 spec.

@@ -266,4 +328,50 @@ def get_ndbuffer_class(reload_config: bool = False) -> type[NDBuffer]:
)


# TODO: merge the get_vXdtype_class_ functions
# these can be used instead of the various parse_X functions (hopefully)
def get_v3dtype_class(dtype: str) -> type[ZarrDType]:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the examples for how these can be used. As in the comment, the idea is for these to replace the various parse_X functions in the codebase.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 23, 2025

@nenb thank you so much for working on this. I really like the ideas in here -- it looks like a great addition to the library, and I will try to give this close examination, and I recommend the other @zarr-developers/python-core-devs do the same. This is a big, forward-looking change so I imagine we will have a lot of opinions on multiple scales here.

Our weekly dev meeting is tomorrow (here's the calendar info, the particular meeting I'm referring to is the zarr-python developers meeting), if the timing works for you, would it be possible for you to attend and give an overview of this PR?


class ZarrDType(metaclass=FrozenClassVariables):

zarr_spec_format: Literal["2", "3"] # the version of the zarr spec used
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
zarr_spec_format: Literal["2", "3"] # the version of the zarr spec used
zarr_format: Literal["2", "3"] # the version of the zarr spec used

dict | None
) # TODO: understand better how this is recommended by the spec

_zarr_spec_identifier: str # implementation detail used to map to core spec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this would be mandatory? Or, how would the identifier for the metadata be specified otherwise?

Copy link
Author

@nenb nenb Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR
This isn't an essential detail, and the implementation can probably be considerably improved if there is consensus around adding a new dtype interface. I've included some details below about how I ended up with this attribute if you are interested though.

Details
At the moment, this is generated from the class name (see the logic in __init_subclass__). The user doesn't specify it, rather the user specifies the class name and the identifier is generated from the class name (potentially prefixed with big_).

Example

class Float16(ZarrDType):
    zarr_spec_format = "3"
    experimental = False
    endianness = "big"
    byte_count = 2
    to_numpy = np.dtype('float16')

This would generate big_float16 for _zarr_spec_identifier.


It's probably a bit clumsy in it's current implementation. I ended up with this pattern i) as a way of tracking the in-memory endianness of the dtype and ii) making sure the class name stays consistent with the identifier - the class name pops up a few times in the entrypoint registration logic, and I didn't want to have a situation where a class name could have a different value to the spec identifier.

Obviously, the convention is that a class needs to be named according to how the dtype is identified in the spec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I wrote in my other comment, we generally want to follow some guidelines around how and what metadata goes into the zarr.json files to achieve interoperability with other libraries/languages that implement Zarr. Unfortunately, some of these guidelines still need to be specced out.

In any case, maybe I missed it, why do we need to persist the endianness as part of the dtype? Shouldn't that be handled by the appropriate array-to-bytes codec (e.g. bytes)? The runtime endianness might need to be handled through runtime configuration, see the ArrayConfig class.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The confusion is my fault - I should probably have made it more clear what metadata I was proposing to serialise to disk.

I am proposing to serialise metadata to disk exactly as the current V3 specs outline i.e. only the dtype identifier and the configuration object.

(But see our discussion above on configuration, I am still learning how it is intended to be used.)

All other attributes here (endianness, experimental, etc) are implementation details to express intent to zarr-python at runtime. This is already done by the numpy dtype API in zarr-python e.g. the use of statements like np.dtype.kind in ["U", "S"]. But this numpy API has limitations e.g. it can't recognise new dtypes like bfloat16 correctly. Which is a large reason why I am proposing that zarr-python have its own dtype interface in this PoC.

In any case, maybe I missed it, why do we need to persist the endianness as part of the dtype?

This was a suggestion on my part, and may not actually turn out to be helpful. A lot of zarr code does things like dtype='>i2' - a good example from the current issues. There will need to be a way of tracking this runtime endianness in Zarr V3.

As you pointed out, it seems likely that this could be handled through a runtime configuration (ArrayConfig). But it felt more natural (to me) to keep track of this information on the dtype itself.

It might be the case that I need to flesh out a complete implementation to try and see what both options look like. But I think it seems likely that there will need to be some way to keep track of the runtime endianness.

And to be clear, in this dtype implementation, I'm not proposing to serialise the information in the endianness attribute to disk in the zarr metadata. It would purely be an implementation detail that zarr-python would use to keep track of runtime endianness.

@nenb
Copy link
Author

nenb commented Jan 23, 2025

@nenb thank you so much for working on this. I really like the ideas in here -- it looks like a great addition to the library, and I will try to give this close examination, and I recommend the other @zarr-developers/python-core-devs do the same. This is a big, forward-looking change so I imagine we will have a lot of opinions on multiple scales here.

Our weekly dev meeting is tomorrow (here's the calendar info, the particular meeting I'm referring to is the zarr-python developers meeting), if the timing works for you, would it be possible for you to attend and give an overview of this PR?

@d-v-b Thanks for being so receptive! I'm happy to attend and give an overview.

What's likely the best format to present at the meeting (I haven't attended before). If I give a 5 minute overview (including some motivation from the AI-side of things), and then leave 10 minutes for questions, would that work? Is there anything specific that you think it's worth focusing on e.g. extension mechanism?

@d-v-b
Copy link
Contributor

d-v-b commented Jan 23, 2025

@nenb that plan sounds good. I'm looking forward to it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants