Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add table statistics #1285

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Add table statistics #1285

wants to merge 3 commits into from

Conversation

ndrluis
Copy link
Collaborator

@ndrluis ndrluis commented Nov 4, 2024

The Java expire snapshot process expires table statistics and partition statistics. I am implementing a statistics table to make our expire snapshot compatible with the Java implementation.

@ndrluis ndrluis changed the title Add table statistics update Add table statistics Nov 4, 2024
@ndrluis
Copy link
Collaborator Author

ndrluis commented Nov 4, 2024

I plan to move the set/remove statistics methods from the Transaction class to another class, such as ManageSnapshot. In the meantime, I’d like to confirm with everyone if I’m heading in the right direction with the current implementation.

@Fokko @sungwy @kevinjqliu

@ndrluis ndrluis changed the title Add table statistics WIP: Add table statistics Nov 4, 2024
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Added a few comments. I think it would also be helpful to include integration tests

pyiceberg/table/metadata.py Show resolved Hide resolved
statistics_path: str = Field(alias="statistics-path")
file_size_in_bytes: int = Field(alias="file-size-in-bytes")
file_footer_size_in_bytes: int = Field(alias="file-footer-size-in-bytes")
blob_metadata: List[BlobMetadata] = Field(alias="blob-metadata")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alter table builder.
"""
updates = (
RemoveStatisticsUpdate(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mind linking the java implementation? do we want to remove all stats?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that we want to remove the statistics of a specific snapshot, and I understand that we have one statistics file per snapshot.

The equivalent code would be the SetStatistics class, which follows the same pattern as our ManageSnapshot class. This is the scenario I want to double-check to ensure we follow the same pattern.

if update.snapshot_id != update.statistics.snapshot_id:
raise ValueError("Snapshot id in statistics does not match the snapshot id in the update")

rest_statistics = [stat for stat in base_metadata.statistics if stat.snapshot_id != update.snapshot_id]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this can be a helper function to filter on snapshot_id

},
{
"snapshot-id": 3055729675574597004,
"statistics-path": "s3://a/b/stats.puffin",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does file need to exist on disk?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there is no validation in place. This is only used for clients that support puffin files and for the expire snapshot procedure, which removes this information from the metadata. If the user wants, they can also remove the file as part of the expire snapshot procedure.

@ndrluis ndrluis force-pushed the add-statistics branch 2 times, most recently from 9b15c86 to d16ef47 Compare November 10, 2024 23:30
@ndrluis ndrluis changed the title WIP: Add table statistics Add table statistics Nov 10, 2024
@ndrluis ndrluis marked this pull request as ready for review November 10, 2024 23:43
@ndrluis
Copy link
Collaborator Author

ndrluis commented Nov 10, 2024

@kevinjqliu could you please review it once more?

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments.

Do you know which engine currently can generate puffin files? would be great to add an integration with a spark generated puffin file

table.update_statistics()
.set_statistics(snapshot_id1, statistics_file1)
.remove_statistics(snapshot_id2)
# Operations are applied on commit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add .commit() instead of the comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or use snapshot_id=1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the commit() and kept the comment consistent with the other examples.

Comment on lines 1150 to 1151
update.set_statistics(1, statistics_file)
update.remove_statistics(2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: replace 1/2 with snapshot_id1/snapshot_id2 to show the input relation

blob_metadata: List[BlobMetadata] = Field(alias="blob-metadata")


def reject_statistics(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: how about filter_statistics_by_snapshot_id?

@ndrluis
Copy link
Collaborator Author

ndrluis commented Nov 12, 2024

Do you know which engine currently can generate puffin files? would be great to add an integration with a spark generated puffin file

@kevinjqliu As far as I know, only Trino can generate them. What kind of test would you like to have? I believe we are covering all relevant cases for this PR. If PyIceberg could generate or read puffin files, then I agree it would be useful to add tests to check compatibility between engines. However, I think it only makes sense to test puffin files during reading, as testing generation would mean verifying the implementation of something that isn’t our responsibility. In this case, it’s just a metadata update.

What do you think?

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for working on this!

Regarding the integration tests, since we're manipulating table metadata to add/remove table stats, it would be great to verify that another source can interact with these stats. Not a hard blocker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants