-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add table statistics #1285
base: main
Are you sure you want to change the base?
Add table statistics #1285
Conversation
6f0bee0
to
a70edb2
Compare
a70edb2
to
384e229
Compare
I plan to move the set/remove statistics methods from the Transaction class to another class, such as ManageSnapshot. In the meantime, I’d like to confirm with everyone if I’m heading in the right direction with the current implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Added a few comments. I think it would also be helpful to include integration tests
statistics_path: str = Field(alias="statistics-path") | ||
file_size_in_bytes: int = Field(alias="file-size-in-bytes") | ||
file_footer_size_in_bytes: int = Field(alias="file-footer-size-in-bytes") | ||
blob_metadata: List[BlobMetadata] = Field(alias="blob-metadata") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, missing key_metadata
https://iceberg.apache.org/spec/#table-statistics
pyiceberg/table/__init__.py
Outdated
The alter table builder. | ||
""" | ||
updates = ( | ||
RemoveStatisticsUpdate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mind linking the java implementation? do we want to remove all stats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that we want to remove the statistics of a specific snapshot, and I understand that we have one statistics file per snapshot.
The equivalent code would be the SetStatistics class, which follows the same pattern as our ManageSnapshot class. This is the scenario I want to double-check to ensure we follow the same pattern.
pyiceberg/table/update/__init__.py
Outdated
if update.snapshot_id != update.statistics.snapshot_id: | ||
raise ValueError("Snapshot id in statistics does not match the snapshot id in the update") | ||
|
||
rest_statistics = [stat for stat in base_metadata.statistics if stat.snapshot_id != update.snapshot_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this can be a helper function to filter on snapshot_id
}, | ||
{ | ||
"snapshot-id": 3055729675574597004, | ||
"statistics-path": "s3://a/b/stats.puffin", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does file need to exist on disk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, there is no validation in place. This is only used for clients that support puffin files and for the expire snapshot procedure, which removes this information from the metadata. If the user wants, they can also remove the file as part of the expire snapshot procedure.
9b15c86
to
d16ef47
Compare
@kevinjqliu could you please review it once more? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments.
Do you know which engine currently can generate puffin files? would be great to add an integration with a spark generated puffin file
table.update_statistics() | ||
.set_statistics(snapshot_id1, statistics_file1) | ||
.remove_statistics(snapshot_id2) | ||
# Operations are applied on commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add .commit()
instead of the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or use snapshot_id=1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the commit() and kept the comment consistent with the other examples.
mkdocs/docs/api.md
Outdated
update.set_statistics(1, statistics_file) | ||
update.remove_statistics(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: replace 1
/2
with snapshot_id1
/snapshot_id2
to show the input relation
pyiceberg/table/statistics.py
Outdated
blob_metadata: List[BlobMetadata] = Field(alias="blob-metadata") | ||
|
||
|
||
def reject_statistics( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: how about filter_statistics_by_snapshot_id
?
@kevinjqliu As far as I know, only Trino can generate them. What kind of test would you like to have? I believe we are covering all relevant cases for this PR. If PyIceberg could generate or read puffin files, then I agree it would be useful to add tests to check compatibility between engines. However, I think it only makes sense to test puffin files during reading, as testing generation would mean verifying the implementation of something that isn’t our responsibility. In this case, it’s just a metadata update. What do you think? |
d16ef47
to
11120bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for working on this!
Regarding the integration tests, since we're manipulating table metadata to add/remove table stats, it would be great to verify that another source can interact with these stats. Not a hard blocker
The Java expire snapshot process expires table statistics and partition statistics. I am implementing a statistics table to make our expire snapshot compatible with the Java implementation.