Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Iceberg] Add Histogram Statistic Support #22365

Merged

Conversation

ZacBlanco
Copy link
Contributor

@ZacBlanco ZacBlanco commented Mar 29, 2024

Description

This changes adds support for the histogram statistic type to Iceberg. There are 3 related commits

  1. Adds support for collecting a superset of supported statistics in the HMS and Iceberg. This is required to allow histograms to be supported while the HMS is in use.
  2. Adds the histogram implementation for Iceberg.
  3. Adds documentation for histograms statistics.

Motivation and Context

Histograms result in better accuracy for output row counts from filter nodes when calculating statistics. This should result in closer estimates in the CBO, resulting in better query plans.

Impact

  1. ANALYZE for Iceberg and Hive will now store statistics in the table's Puffin files.
  2. Writes new puffin blob type containing the histogram.

Test Plan

  • New tests to ensure histograms are collected for supported types
  • Tests that ensure the histograms result in more accurate plan stat estimates for non-uniform data

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Add support for the histogram statistic type. :pr:`22365`

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-histogram-storage branch from 58cfe30 to e1a2caa Compare April 17, 2024 13:37
@steveburnett
Copy link
Contributor

Consider this suggested change for the release note entry:

== RELEASE NOTES ==

Iceberg Connector Changes
* Add support for the histogram statistic type. :pr:`22365`

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-histogram-storage branch 6 times, most recently from bac6b3f to 0e3a87c Compare April 23, 2024 15:26
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the docs! A few nits suggested for clarity and style. Let me know what you think.

presto-docs/src/main/sphinx/admin/properties.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/admin/properties.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/admin/properties.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/optimizer/statistics.rst Outdated Show resolved Hide resolved
@ZacBlanco ZacBlanco marked this pull request as ready for review April 23, 2024 17:13
@ZacBlanco ZacBlanco requested review from a team, hantangwangd and jaystarshot as code owners April 23, 2024 17:13
@ZacBlanco ZacBlanco requested a review from presto-oss April 23, 2024 17:13
Copy link

github-actions bot commented Apr 23, 2024

Codenotify: Notifying subscribers in CODENOTIFY files for diff 901da6d...758f891.

No notifications.

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-histogram-storage branch 2 times, most recently from 3d551e9 to 65b6fe8 Compare April 26, 2024 03:02
steveburnett
steveburnett previously approved these changes Apr 26, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, new local docs build, everything looks good. Thanks!

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-histogram-storage branch from 65b6fe8 to 0d6389a Compare June 4, 2024 17:46
@ZacBlanco ZacBlanco requested a review from feilong-liu as a code owner June 4, 2024 17:46
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-histogram-storage branch from 0d6389a to 24398a5 Compare June 4, 2024 21:55
Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a rough glance, and bring some questions for discussion. Will take a detailed look in the next one or two days.

presto-iceberg/pom.xml Outdated Show resolved Hide resolved
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-histogram-storage branch from 24398a5 to 56b79b0 Compare June 20, 2024 17:51
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updated doc! Two nits about formatting, no issues with the content in any of the three doc files.

presto-docs/src/main/sphinx/admin/properties.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/optimizer/statistics.rst Outdated Show resolved Hide resolved
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-histogram-storage branch from 56b79b0 to 492ab5f Compare June 20, 2024 19:31
steveburnett
steveburnett previously approved these changes Jun 20, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Thanks for the quick fix! Pull updated branch, new local doc build, looks good.

Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. The changes look good to me, only some nits.

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-histogram-storage branch from 56b02fd to d041947 Compare September 23, 2024 17:48
Copy link
Contributor

@aaneja aaneja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a short first pass


The set of statistics available for a particular query depends on the connector
being used and can also vary by table or even by table layout. For example, the
Hive connector does not currently provide statistics on data size.

Table statistics can be displayed via the Presto SQL interface using the
Table statistics can be displayed using a SQL statement with the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can be fetched using the below SQL query sounds better to me

presto-docs/src/main/sphinx/optimizer/statistics.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/sql/show-stats.rst Outdated Show resolved Hide resolved
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-histogram-storage branch from d041947 to eae161e Compare October 4, 2024 21:12
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-histogram-storage branch from eae161e to ee5e3fd Compare October 8, 2024 23:28
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-histogram-storage branch 2 times, most recently from 4b9cf64 to dfe82f6 Compare October 14, 2024 21:05
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-histogram-storage branch 2 times, most recently from 16b4c7f to 4d3b2e9 Compare October 22, 2024 18:08
hantangwangd
hantangwangd previously approved these changes Nov 3, 2024
Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the whole work, the change looks good to me.

Copy link
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some whitespace nits

presto-iceberg/pom.xml Outdated Show resolved Hide resolved
@@ -605,7 +602,7 @@
<dependency>
<groupId>org.apache.iceberg</groupId>
<artifactId>iceberg-core</artifactId>
<version>1.5.0</version>
<version>${dep.iceberg.version}</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

The set of ColumnStatisticsMetadata defined by the Hive
and non-Hive connectors are not equivalent. However, it is possible
to collect the superset of the relevant metadata and use it for ANALYZE.
The returned statistics just need to be filtered out to contain only
the relevant column statistics.

This may include duplicate calculations for some statistics. For
example, with distinct values Iceberg puffin files can store the
result of sketch_theta for distinct values, but the code path for
storing the statistic in the HMS requires a direct value from
approx_distinct. Thus, ANALYZE may compute a value twice.
Utilizes the sketch_kll function to generate histograms and store them
into the Iceberg table's puffin files for table-level statistic storage.

Histograms are always collected by ANALYZE, but they are not used by the
cost calculator unless enabled via optimizer.use-histograms
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-histogram-storage branch from 0339ee0 to 758f891 Compare December 13, 2024 23:55
@tdcmeehan tdcmeehan dismissed stale reviews from elharo and ScrapCodes December 16, 2024 19:37

Stale

@tdcmeehan tdcmeehan merged commit 05bc56c into prestodb:master Dec 16, 2024
58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants