Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC-0005. Implementation Scalar function stats propagation, Phase 1 #23545

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ScrapCodes
Copy link

@ScrapCodes ScrapCodes commented Aug 29, 2024

Description

  1. Support for annotating functions with both constant stats and propagating source stats.
  2. Added tests for the same.
  3. Added Scalar stats calculation based on annotation and tests for the same.
  4. Introduced the feature flag and session flag.

Not added SQLInvokedScalarFunctions.
Not annotated builtin functions, as that is covered in next implementation phase. Not added C++ changes as this phase only covers Java side of changes.

Motivation and Context

https://github.com/prestodb/rfcs/blob/main/RFC-0005-functions-stats.md

Impact

None unless the user chooses to enable the feature via setting the session/feature flag.
A new session flag, scalar_function_stats_propagation_enabled and a new feature config will be introduced i.e. optimizer.scalar-function-stats-propagation-enabled, by setting this session flag or feature flag, this feature can be turned on or off.

Test Plan

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Add configuration property ``optimizer.scalar-function-stats-propagation-enabled`` and session property ``scalar_function_stats_propagation_enabled`` to enable stats propagation by annotation, supporting `RFC-5 <https://github.com/prestodb/rfcs/blob/main/RFC-0005-functions-stats.md>`_. :pr: `23545`

@ScrapCodes ScrapCodes force-pushed the RFC_5_UDF_STATS_PHASE_1 branch from 1504ced to c358d95 Compare August 29, 2024 12:09
@ScrapCodes ScrapCodes marked this pull request as ready for review August 29, 2024 13:11
@ScrapCodes ScrapCodes requested a review from presto-oss August 29, 2024 13:11
@steveburnett
Copy link
Contributor

== RELEASE NOTES ==

General Changes
* Add configuration property ``optimizer.scalar-function-stats-propagation-enabled`` and session property ``scalar_function_stats_propagation_enabled`` to enable stats propagation by annotation, supporting `RFC-5 <https://github.com/prestodb/rfcs/blob/main/RFC-0005-functions-stats.md>`_. :pr: `23545`

@aaneja aaneja self-requested a review August 29, 2024 16:03
@ScrapCodes ScrapCodes changed the title Phase 1. Implementation for RFC-0005: Scalar function stats propagation. RFC-0005. Implementation Scalar function stats propagation, Phase 1 Aug 30, 2024
@ScrapCodes
Copy link
Author

Thank you for the comment @steveburnett , do you have a suggestion on documenting this feature in full detail e.g. developer docs.

@ScrapCodes ScrapCodes force-pushed the RFC_5_UDF_STATS_PHASE_1 branch 2 times, most recently from de5982e to 49a20e0 Compare August 30, 2024 08:59
@ScrapCodes ScrapCodes force-pushed the RFC_5_UDF_STATS_PHASE_1 branch from 49a20e0 to 495734f Compare September 6, 2024 11:14
@ScrapCodes
Copy link
Author

Hi @ZacBlanco , can you please take a look!

@steveburnett
Copy link
Contributor

Thank you for the comment @steveburnett , do you have a suggestion on documenting this feature in full detail e.g. developer docs.

Yes, thanks! If these properties are relevant to Presto, please add entries for them to https://github.com/prestodb/presto/blob/master/presto-docs/src/main/sphinx/admin/properties.rst.

If these are Presto C++ (Prestissimo), please add the entries to https://github.com/prestodb/presto/blob/master/presto-docs/src/main/sphinx/presto_cpp/features.rst#session-properties and https://github.com/prestodb/presto/blob/master/presto-docs/src/main/sphinx/presto_cpp/properties.rst as appropriate.

@ZacBlanco ZacBlanco self-requested a review September 11, 2024 20:06
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding the documentation! A few nits of formatting, and a couple of revisions for consistency with the existing documentation.

presto-docs/src/main/sphinx/admin/properties.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/admin/properties.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/admin/properties.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/admin/properties.rst Outdated Show resolved Hide resolved
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick response! Two final nits that I should have caught in my first pass, and I think this should be everything.

presto-docs/src/main/sphinx/admin/properties.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/admin/properties.rst Outdated Show resolved Hide resolved
steveburnett
steveburnett previously approved these changes Sep 19, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, reviewed new local doc build, looks good. Thanks!

@ScrapCodes ScrapCodes marked this pull request as draft September 20, 2024 02:15
@ScrapCodes ScrapCodes marked this pull request as ready for review September 23, 2024 11:15
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit of formatting.

presto-docs/src/main/sphinx/admin/properties.rst Outdated Show resolved Hide resolved
steveburnett
steveburnett previously approved these changes Sep 23, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, new local doc build, looks good. Thanks!

@ScrapCodes ScrapCodes force-pushed the RFC_5_UDF_STATS_PHASE_1 branch 2 times, most recently from 9e1a403 to ad6e77d Compare October 3, 2024 10:12
@ScrapCodes ScrapCodes force-pushed the RFC_5_UDF_STATS_PHASE_1 branch from ad6e77d to a9f04a8 Compare October 4, 2024 10:07
@ScrapCodes ScrapCodes force-pushed the RFC_5_UDF_STATS_PHASE_1 branch from a9f04a8 to 99302d3 Compare October 10, 2024 08:34
Copy link
Contributor

@ZacBlanco ZacBlanco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor things, otherwise lgtm

@ScrapCodes ScrapCodes force-pushed the RFC_5_UDF_STATS_PHASE_1 branch from 99302d3 to 9d77026 Compare October 15, 2024 10:03
@ScrapCodes ScrapCodes requested a review from ZacBlanco October 16, 2024 00:23
@ScrapCodes
Copy link
Author

Hi @elharo is it good to go?

Copy link
Contributor

@elharo elharo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sold on using doubles for row counts and the like. Doubles are not just a bigger long and have round off and representation problems integers don't. If these things are really going to exceed the size of a long (which I'm not sure they do) then a BigInteger would be preferred.

ZacBlanco
ZacBlanco previously approved these changes Oct 16, 2024
@ScrapCodes
Copy link
Author

I'm still not sold on using doubles for row counts and the like. Doubles are not just a bigger long and have round off and representation problems integers don't. If these things are really going to exceed the size of a long (which I'm not sure they do) then a BigInteger would be preferred.

BigInteger does not exist in java natively. May be this is a decision that presto project has made much before this PR came in. Will it be a good idea, if you can start a mailing group thread with broader audience?

@ScrapCodes ScrapCodes requested a review from elharo October 17, 2024 05:05
@elharo
Copy link
Contributor

elharo commented Oct 17, 2024

BigInteger isn't a primitive type, but java.math.BigInteger is available if you need it. Again, I'm not convinced you do. long should work here.

ZacBlanco
ZacBlanco previously approved these changes Nov 8, 2024
@ScrapCodes ScrapCodes dismissed stale reviews from aaneja and ZacBlanco via 40b6f47 November 20, 2024 07:07
@ScrapCodes ScrapCodes force-pushed the RFC_5_UDF_STATS_PHASE_1 branch from e248c01 to 40b6f47 Compare November 20, 2024 07:07
@ScrapCodes
Copy link
Author

Rebased with master

@ScrapCodes ScrapCodes force-pushed the RFC_5_UDF_STATS_PHASE_1 branch from 9551c16 to 7693879 Compare December 13, 2024 09:09
1. Support for annotating functions with both constant stats and propagating source stats.
2. Added tests for the same.
3. Added Scalar stats calculation based on annotation and tests for the same.

Not added SQLInvokedScalarFunctions.
Not annotated builtin functions, as that is covered in next implementation phase.
Not added C++ changes as this phase only covers Java side of changes.

Added documentation for the new properties and ...
 1. Previously, if any of the source stats were missing, we would still compute the max/min/sum of argument stats etc..
  now we propagate NaNs if any one of the arguments' stats are missing.

2. For distinct values count, upper bounding it to row count is as good as unknown. Therefore, the approach here is, when distinctValuesCount is greater than row count and is provided via annotation we set it to unknown.
A function developer has full control here, for example developer can choose to upper bound or not by selecting the appropriate StatsPropagationBehavior value.

 3. For average row size,
    a) If average row size is provided via ScalarFunctionConstantStats annotation, then we allow even if the size is greater than functions return type width.
    b) If average row size is provided via one of the StatsPropagationBehavior values, then we upper bound it to functions return type width - if available.
    If both (a) and (b) is unknown, then we default it to functions return type width if available.

This way the function developer has greater control.

Added new behaviour SUM_ARGUMENTS_UPPER_BOUND_ROW_COUNT which would upper bound the values to row count, so that summing distinct values count not exceed row counts.
@ScrapCodes ScrapCodes force-pushed the RFC_5_UDF_STATS_PHASE_1 branch from c334fdb to 07d2d0c Compare December 28, 2024 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants