Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dataset Attribute type for Pytrees #5732

Merged
merged 49 commits into from
Jun 21, 2024
Merged

Add Dataset Attribute type for Pytrees #5732

merged 49 commits into from
Jun 21, 2024

Conversation

brownj85
Copy link
Contributor

@brownj85 brownj85 commented May 23, 2024

Context:
PyTrees are an ideal format for dataset attributes, since they provide a compact representation of deeply nested structures, and Pennylane Operators can easily be converted to and from pytrees.

Shortcut story: https://app.shortcut.com/xanaduai/story/63174/datasets-serialization-using-pytrees

Description of the Change:

  • Adds a pytree.serialization module for converting PyTreeStructure objects to and from a JSON representation
  • Adds a DatasetPyTree data attribute type that can store any pytree-compatible type
  • DatasetPyTree is now the default attribute type for Operator
  • The pytree.unflatten method disables recording, to prevent deserialized operators from being queued

Benefits:

  • Datasets support any Pennylane object with a Pytree representation and serializable metadata
  • Complex serialization logic in DatasetOperator can be deprecated
  • Reduces file size of datasets

Possible Drawbacks:

  • Pytree datasets will not be compatible with versions < 0.37
  • Type information for metadata is not preserved - tuple, Wires, and Shots objects will all be converted
    to list on a round-trip. __init__ methods are responsible for ensuring that metadata is converted to the correct type

Base automatically changed from pytree-flatten-unflatten to master May 24, 2024 14:28
@brownj85 brownj85 changed the title Datasets pytrees Add Dataset Attribute type for Pytrees May 24, 2024
@brownj85 brownj85 requested a review from mudit2812 May 28, 2024 14:19
Copy link
Contributor

@mudit2812 mudit2812 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @brownj85 . Very excited to get this in!

Copy link
Contributor

@albi3ro albi3ro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any security concerns about deserializing json?

@brownj85
Copy link
Contributor Author

Any security concerns about deserializing json?

Not that I can think of - a very large JSON string could cause the interpreter to consume a lot of cpu and memory, but that's not a huge concern since we're not running this on a server.

@brownj85 brownj85 requested a review from albi3ro May 30, 2024 14:36
@DSGuala DSGuala self-requested a review June 4, 2024 19:10
Comment on lines -15 to +17
from typing import NamedTuple, Sequence, Tuple
from typing import NamedTuple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chances of this causing issues with doing shot-based measurements? The tests seem to be passing, so maybe no problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to be 100% sure but it shouldn't cause a problem - Sequence is less restrictive than tuple so anything that worked before will still work


d.op = qml.PauliX(0)

assert isinstance(d.attrs["op"], DatasetPyTree)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this pass while d.op fails? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accessing d.op returns the deserialized object, e.g PauliX in this case. attrs["op"] returns the underlying dataset attribute type that's used for de-serialization

@DSGuala DSGuala added this to the v0.37 milestone Jun 17, 2024
doc/releases/changelog-dev.md Outdated Show resolved Hide resolved
pennylane/data/attributes/pytree.py Outdated Show resolved Hide resolved
# but will fail if the leaves are not homogenous
DatasetArray(leaves, parent_and_key=(bind, "leaves"))
except (ValueError, TypeError):
DatasetList(leaves, parent_and_key=(bind, "leaves"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching exceptions might be expensive for large data. Is there a way to rather rely on if conditions here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to delegate to numpy here - otherwise we'd need to implement a check that leaves is homogenous and array-compatible, which would likely just be duplicating numpy's logic. This would also be slower in the ideal case that leaves is homogenous.

I had the same concern about performance, but datasets are read a lot more than they're written, and DatasetArray is a lot more compact and performant than DatasetList. So the tradeoff makes sense IMO


if has_jax:
_register_pytree_with_jax(pytree_type, flatten_fn, unflatten_fn)


def is_pytree(type_: type[Any]) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check for some kind of internal structure here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the jax definition of a pytree I don't think so. Any container-like (list, dict) or registered object is a Pytree, even if they contain objects ("leaves") that aren't.

Co-authored-by: Utkarsh <utkarshazad98@gmail.com>
@DSGuala
Copy link
Contributor

DSGuala commented Jun 20, 2024

Looks good to me 👍
I don't think any of the pending questions/comments are blocking. We can merge for the release.

@DSGuala DSGuala enabled auto-merge (squash) June 21, 2024 13:28
@DSGuala DSGuala merged commit 21b40c5 into master Jun 21, 2024
40 checks passed
@DSGuala DSGuala deleted the datasets-pytrees branch June 21, 2024 14:11
mudit2812 added a commit that referenced this pull request Jul 2, 2024
**Context:**
PyTrees are an ideal format for dataset attributes, since they provide a
compact representation of deeply nested structures, and Pennylane
Operators can easily be converted to and from pytrees.

Shortcut story:
https://app.shortcut.com/xanaduai/story/63174/datasets-serialization-using-pytrees

**Description of the Change:**
- Adds a `pytree.serialization` module for converting `PyTreeStructure`
objects to and from a JSON representation
- Adds a `DatasetPyTree` data attribute type that can store any
pytree-compatible type
- `DatasetPyTree` is now the default attribute type for `Operator`
- The `pytree.unflatten` method disables recording, to prevent
deserialized operators from being queued

**Benefits:**
- Datasets support any Pennylane object with a Pytree representation and
serializable metadata
- Complex serialization logic in ``DatasetOperator`` can be deprecated
- Reduces file size of datasets

**Possible Drawbacks:**
- Pytree datasets will not be compatible with versions < 0.37
- Type information for metadata is not preserved - `tuple`, `Wires`, and
`Shots` objects will all be converted
to `list` on a round-trip. `__init__` methods are responsible for
ensuring that metadata is converted to the correct type

---------

Co-authored-by: albi3ro <chrissie.c.l@gmail.com>
Co-authored-by: Christina Lee <christina@xanadu.ai>
Co-authored-by: Thomas R. Bromley <49409390+trbromley@users.noreply.github.com>
Co-authored-by: Mudit Pandey <mudit.pandey@xanadu.ai>
Co-authored-by: Paul Finlay <50180049+doctorperceptron@users.noreply.github.com>
Co-authored-by: Diego <67476785+DSGuala@users.noreply.github.com>
Co-authored-by: Utkarsh <utkarshazad98@gmail.com>
Co-authored-by: Diego <diego_guala@hotmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants