Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype of new DType interface #2750

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions src/zarr/core/dtype/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from zarr.core.dtype.core import (

Check warning on line 1 in src/zarr/core/dtype/__init__.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/__init__.py#L1

Added line #L1 was not covered by tests
ZarrDType
)

__all__ = [

Check warning on line 5 in src/zarr/core/dtype/__init__.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/__init__.py#L5

Added line #L5 was not covered by tests
"ZarrDType"
]
204 changes: 204 additions & 0 deletions src/zarr/core/dtype/core.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
"""
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly the same information as what is included in the PR conversation.

# Overview

This module provides a proof-of-concept standalone interface for managing dtypes in the zarr-python codebase.

The `ZarrDType` class introduced in this module effectively acts as a replacement for `np.dtype` throughout the
zarr-python codebase. It attempts to encapsulate all relevant runtime information necessary for working with
dtypes in the context of the Zarr V3 specification (e.g. is this a core dtype or not, how many bytes and what
endianness is the dtype etc). By providing this abstraction, the module aims to:

- Simplify dtype management within zarr-python
- Support runtime flexibility and custom extensions
- Remove unnecessary dependencies on the numpy API

## Extensibility

The module attempts to support user-driven extensions, allowing developers to introduce custom dtypes
without requiring immediate changes to zarr-python. Extensions can leverage the current entrypoint mechanism,
enabling integration of experimental features. Over time, widely adopted extensions may be formalized through
inclusion in zarr-python or standardized via a Zarr Enhancement Proposal (ZEP), but this is not essential.

## Examples

### Core `dtype` Registration

The following example demonstrates how to register a built-in `dtype` in the core codebase:

```python
from zarr.core.dtype import ZarrDType
from zarr.registry import register_v3dtype

class Float16(ZarrDType):
zarr_spec_format = "3"
experimental = False
endianness = "little"
byte_count = 2
to_numpy = np.dtype('float16')

register_v3dtype(Float16)
```

### Entrypoint Extension

The following example demonstrates how users can register a new `bfloat16` dtype for Zarr.
This approach adheres to the existing Zarr entrypoint pattern as much as possible, ensuring
consistency with other extensions. The code below would typically be part of a Python package
that specifies the entrypoints for the extension:

```python
import ml_dtypes
from zarr.core.dtype import ZarrDType # User inherits from ZarrDType when creating their dtype

class Bfloat16(ZarrDType):
zarr_spec_format = "3"
experimental = True
endianness = "little"
byte_count = 2
to_numpy = np.dtype('bfloat16') # Enabled by importing ml_dtypes
configuration_v3 = {
"version": "example_value",
"author": "example_value",
"ml_dtypes_version": "example_value"
}
```

### dtype lookup

The following examples demonstrate how to perform a lookup for the relevant ZarrDType, given
a string that matches the dtype Zarr specification ID, or a numpy dtype object:

```
from zarr.registry import get_v3dtype_class, get_v3dtype_class_from_numpy

get_v3dtype_class('complex64') # returns little-endian Complex64 ZarrDType
get_v3dtype_class('not_registered_dtype') # ValueError

get_v3dtype_class_from_numpy('>i2') # returns big-endian Int16 ZarrDType
get_v3dtype_class_from_numpy(np.dtype('float32')) # returns little-endian Float32 ZarrDType
get_v3dtype_class_from_numpy('i10') # ValueError
```

### String dtypes

The following indicates one possibility for supporting variable-length strings. It is via the
entrypoint mechanism as in a previous example. The Apache Arrow specification does not currently
include a dtype for fixed-length strings (only for fixed-length bytes) and so I am using string
here to implicitly refer to a variable-length string data (there may be some subtleties with codecs
that means this needs to be refined further):

```python
import numpy as np
from zarr.core.dtype import ZarrDType # User inherits from ZarrDType when creating their dtype

try:
to_numpy = np.dtypes.StringDType()
except AttributeError:
to_numpy = np.dtypes.ObjectDType()

class String(ZarrDType):
zarr_spec_format = "3"
experimental = True
endianness = 'little'
byte_count = None # None is defined to mean variable
to_numpy = to_numpy
```

### int4 dtype

There is currently considerable interest in the AI community in 'quantising' models - storing
models at reduced precision, while minimising loss of information content. There are a number
of sub-byte dtypes that the community are using e.g. int4. Unfortunately numpy does not
currently have support for handling such sub-byte dtypes in an easy way. However, they can
still be held in a numpy array and then passed (in a zero-copy way) to something like pytorch
which can handle appropriately:

```python
import numpy as np
from zarr.core.dtype import ZarrDType # User inherits from ZarrDType when creating their dtype

class Int4(ZarrDType):
zarr_spec_format = "3"
experimental = True
endianness = 'little'
byte_count = 1 # this is ugly, but I could change this from byte_count to bit_count if there was consensus
to_numpy = np.dtype('B') # could also be np.dtype('V1'), but this would prevent bit-twiddling
configuration_v3 = {
"version": "example_value",
"author": "example_value",
}
```
"""

from __future__ import annotations

Check warning on line 133 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L133

Added line #L133 was not covered by tests

from typing import Any, Literal

Check warning on line 135 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L135

Added line #L135 was not covered by tests

import numpy as np

Check warning on line 137 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L137

Added line #L137 was not covered by tests


# perhaps over-complicating, but I don't want to allow the attributes to be patched
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation detail: I decided that to try and freeze the class attributes after a dtype has been created. I used metaclasses for this. It's not essential.

class FrozenClassVariables(type):
def __setattr__(cls, attr, value):
if hasattr(cls, attr):
raise ValueError(

Check warning on line 144 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L141-L144

Added lines #L141 - L144 were not covered by tests
f"Attribute {attr} on ZarrDType class can not be changed once set."
)


class ZarrDType(metaclass=FrozenClassVariables):

Check warning on line 149 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L149

Added line #L149 was not covered by tests
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most important thing here IMO is that ZarrDType should contain all attributes required when introspecting dtypes at runtime.

I would like to replace all statements like np.dtype.kind in ["S", "U"] or np.dtype.itemsize > 0 in the codebase with statements like if ZarrDType.byte_count > 0 etc. Basically, replacing the numpy dtype API with a new zarr-specific API.

I have included the attributes that I currently believe are necessary. But some may be unnuecesary, and I may have forgotten others. It's a first attempt!


zarr_spec_format: Literal["2", "3"] # the version of the zarr spec used
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
zarr_spec_format: Literal["2", "3"] # the version of the zarr spec used
zarr_format: Literal["2", "3"] # the version of the zarr spec used

experimental: bool # is this in the core spec or not
endianness: Literal[

Check warning on line 153 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L151-L153

Added lines #L151 - L153 were not covered by tests
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zarr V3 has made the decision to use a codec for endianness. Endianness is not to be attached to the dtype.
This creates some problems for the Zarr API, which is still linked to numpy's API in a number of ways, including
the ability to create in-memory arrays of arbitrary endianness.

Currently, I think that the practical solution is for zarr-python to have dtypes that distinguish between big and little endianess in memory, but that when serialised to disk, always serialise the little endian dtype.

I can elaborate on this with examples if helpful, but basically, endianness would just be an implementation detail for zarr-python that would allow it to track the endianness of an object in memory, and it wouldn't actually be used when serialising to disk.

"big", "little", None
] # None indicates not defined i.e. single byte or byte strings
byte_count: int | None # None indicates variable count
to_numpy: np.dtype[

Check warning on line 157 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L156-L157

Added lines #L156 - L157 were not covered by tests
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the bfloat16 example about how this might require new packages to be installed.

Any
] # may involve installing a a numpy extension e.g. ml_dtypes;

configuration_v3: (

Check warning on line 161 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L161

Added line #L161 was not covered by tests
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't clear to me how this is intended to be used in the spec...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, dtypes can be represented in the json metadata as a short-hand (str) or dict ({ "name": str, "configuration": None | { ... } }. The configuration key is optional and could be used for dtypes that need additional configuration. If there is no configuration key, the short-hand version is equivalent to the dict with just a name key.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, bfloat16 is equivalent to {"name":"bfloat16"}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, interesting, thanks!

My current thinking is that every dtype that is not in the core spec should include a configuration key. I would like to introduce a convention where extension dtypes also provide metadata like 'author', 'extension_version', etc., to give the best chance of reproducibility/re-use in the future. At least, until an extension dtype becomes a core dtype.

Is the configuration key an appropriate location for such metadata?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assigning names for extensions, such as dtypes, is something that the Zarr core spec should define to coordinate between the different Zarr implementations. However, there are some gaps in the spec that provide clear guidance for doing that. In the Zarr steering council, we are currently evaluating different options that we will propose to the community, shortly. Our goal is to achieve a naming mechanism that avoids naming conflicts. Our current favorite is to have 2 types of names:

  • URI-based names, e.g. https://nenb.github.io/bfloat16, which can be freely used by anybody who reasonably controls the URI. The URI doesn't need to resolve to anything; it is just a name. However, it makes sense to have some useful information under the URI, e.g. a spec document.
  • Raw names, e.g. bfloat16, which would be assigned through a centralized registry (e.g. a git repo) through a to-be-defined process. This will entail a bit more process than the URI-based names and will come with some expectations w.r.t. backwards compatibility.

Is the configuration key an appropriate location for such metadata?

Not necessarily. I think this information would be better placed in specification documents of the dtypes.

dict | None
) # TODO: understand better how this is recommended by the spec

_zarr_spec_identifier: str # implementation detail used to map to core spec

Check warning on line 165 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L165

Added line #L165 was not covered by tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this would be mandatory? Or, how would the identifier for the metadata be specified otherwise?

Copy link
Author

@nenb nenb Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR
This isn't an essential detail, and the implementation can probably be considerably improved if there is consensus around adding a new dtype interface. I've included some details below about how I ended up with this attribute if you are interested though.

Details
At the moment, this is generated from the class name (see the logic in __init_subclass__). The user doesn't specify it, rather the user specifies the class name and the identifier is generated from the class name (potentially prefixed with big_).

Example

class Float16(ZarrDType):
    zarr_spec_format = "3"
    experimental = False
    endianness = "big"
    byte_count = 2
    to_numpy = np.dtype('float16')

This would generate big_float16 for _zarr_spec_identifier.


It's probably a bit clumsy in it's current implementation. I ended up with this pattern i) as a way of tracking the in-memory endianness of the dtype and ii) making sure the class name stays consistent with the identifier - the class name pops up a few times in the entrypoint registration logic, and I didn't want to have a situation where a class name could have a different value to the spec identifier.

Obviously, the convention is that a class needs to be named according to how the dtype is identified in the spec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I wrote in my other comment, we generally want to follow some guidelines around how and what metadata goes into the zarr.json files to achieve interoperability with other libraries/languages that implement Zarr. Unfortunately, some of these guidelines still need to be specced out.

In any case, maybe I missed it, why do we need to persist the endianness as part of the dtype? Shouldn't that be handled by the appropriate array-to-bytes codec (e.g. bytes)? The runtime endianness might need to be handled through runtime configuration, see the ArrayConfig class.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The confusion is my fault - I should probably have made it more clear what metadata I was proposing to serialise to disk.

I am proposing to serialise metadata to disk exactly as the current V3 specs outline i.e. only the dtype identifier and the configuration object.

(But see our discussion above on configuration, I am still learning how it is intended to be used.)

All other attributes here (endianness, experimental, etc) are implementation details to express intent to zarr-python at runtime. This is already done by the numpy dtype API in zarr-python e.g. the use of statements like np.dtype.kind in ["U", "S"]. But this numpy API has limitations e.g. it can't recognise new dtypes like bfloat16 correctly. Which is a large reason why I am proposing that zarr-python have its own dtype interface in this PoC.

In any case, maybe I missed it, why do we need to persist the endianness as part of the dtype?

This was a suggestion on my part, and may not actually turn out to be helpful. A lot of zarr code does things like dtype='>i2' - a good example from the current issues. There will need to be a way of tracking this runtime endianness in Zarr V3.

As you pointed out, it seems likely that this could be handled through a runtime configuration (ArrayConfig). But it felt more natural (to me) to keep track of this information on the dtype itself.

It might be the case that I need to flesh out a complete implementation to try and see what both options look like. But I think it seems likely that there will need to be some way to keep track of the runtime endianness.

And to be clear, in this dtype implementation, I'm not proposing to serialise the information in the endianness attribute to disk in the zarr metadata. It would purely be an implementation detail that zarr-python would use to keep track of runtime endianness.


def __init_subclass__( # enforces all required fields are set and basic sanity checks

Check warning on line 167 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L167

Added line #L167 was not covered by tests
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation detail: I thought it would be helpful to prevent class creation unless all attributes were defined.

cls,
**kwargs,
) -> None:

required_attrs = [

Check warning on line 172 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L172

Added line #L172 was not covered by tests
"zarr_spec_format",
"experimental",
"endianness",
"byte_count",
"to_numpy",
]
for attr in required_attrs:
if not hasattr(cls, attr):
raise ValueError(f"{attr} is a required attribute for a Zarr dtype.")

Check warning on line 181 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L179-L181

Added lines #L179 - L181 were not covered by tests

if not hasattr(cls, "configuration_v3"):
cls.configuration_v3 = None

Check warning on line 184 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L183-L184

Added lines #L183 - L184 were not covered by tests

cls._zarr_spec_identifier = (

Check warning on line 186 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L186

Added line #L186 was not covered by tests
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the way I am proposing endianness is just as an implementation detail for zarr-python to track the endianness of in-memory objects. When serialised to disk, this big_ prefix would always be removed.

"big_" + cls.__qualname__.lower()
if cls.endianness == "big"
else cls.__qualname__.lower()
) # how this dtype is identified in core spec; convention is prefix with big_ for big-endian

cls._validate() # sanity check on basic requirements

Check warning on line 192 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L192

Added line #L192 was not covered by tests

super().__init_subclass__(**kwargs)

Check warning on line 194 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L194

Added line #L194 was not covered by tests

# TODO: add further checks
@classmethod
def _validate(cls):

Check warning on line 198 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L197-L198

Added lines #L197 - L198 were not covered by tests
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an example of the sort of validation that could happen.


if cls.byte_count is not None and cls.byte_count <= 0:
raise ValueError("byte_count must be a positive integer.")

Check warning on line 201 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L200-L201

Added lines #L200 - L201 were not covered by tests

if cls.byte_count == 1 and cls.endianness is not None:
raise ValueError("Endianness must be None for single-byte types.")

Check warning on line 204 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L203-L204

Added lines #L203 - L204 were not covered by tests
Loading
Loading