Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ColumnStatistics::Sum #14074

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Conversation

gatesn
Copy link

@gatesn gatesn commented Jan 10, 2025

Which issue does this PR close?

This PR adds a sum statistic to DataFusion.

Future use will include optimizing aggregation functions (sum, avg, count), see https://github.com/apache/datafusion/pull/13736/files for examples.

Are there any user-facing changes?

The ColumnStatistics struct has an extra public sum_value field.

@github-actions github-actions bot added physical-expr Physical Expressions core Core DataFusion crate common Related to common crate proto Related to proto crate labels Jan 10, 2025
@alamb alamb changed the title Add a sum statistic Add a ColumnStatistics::Sum Jan 12, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @gatesn -- I think this is a nice addition.

It looks like the cargo fmt test is failing

Ideally we would add unit test coverage for Precision::multiply Precision::sub and Precision::cast_to before we merged.

Thanks again -- excited to see this working

FYI @suremarc @berkaysynnada / @ozankabak as this changes statistics and I think you are already working on things related to that:

@@ -436,6 +492,8 @@ pub struct ColumnStatistics {
pub max_value: Precision<ScalarValue>,
/// Minimum value of column
pub min_value: Precision<ScalarValue>,
/// Sum value of a column
pub sum_value: Precision<ScalarValue>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I think we mentioned in #13736 my only real concern with this addition is that it will make ColumnStatistics even bigger (each ScalarValue is quite large already and ColumnStatistics are copied a bunch

However, I think the "right" fix for that is to move to using a different statistics representation (e.g. Arc::ColumnStatistics) so I don't see this as a blocker

datafusion/common/src/stats.rs Show resolved Hide resolved
@alamb alamb changed the title Add a ColumnStatistics::Sum Add ColumnStatistics::Sum Jan 12, 2025
(_, _) => Precision::Absent,
}
}

/// Casts the value to the given data type, propagating exactness information.
pub fn cast_to(&self, data_type: &DataType) -> Result<Precision<ScalarValue>> {
Copy link
Author

@gatesn gatesn Jan 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb one question I have is whether this should return a Result, or we should assume that a failed cast implies overflow and therefore return Precision::Absent?

The caller (currently in cross-join) unwraps to Absent, I just didn't know whether to internalize that here.

Edit: I decided it was better to propagate the error and allow the caller to decide. It was more useful in a couple of places.

Copy link
Author

@gatesn gatesn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some tests and hopefully appeases the linter!

@berkaysynnada
Copy link
Contributor

FYI @suremarc @berkaysynnada / @ozankabak as this changes statistics and I think you are already working on things related to that:

We've started to refactor. The design is complete, and the implementation is in progress.

I’ve taken a look at this and have some questions. For example, are we planning to add many types of functions to statistics, or is there a defined list of statistics that can be inferred from the sources or have meaningful applications in optimizer rules? If we agree that these kinds of extensions to column statistics are indeed useful and obtainable, then we can proceed with merging this. We would also ensure it is included in the new setup.

FYI @ozankabak

@alamb
Copy link
Contributor

alamb commented Jan 13, 2025

We've started to refactor. The design is complete, and the implementation is in progress.

Thanks! Is there anywhere I can follow along @berkaysynnada (I am particularly interested in what the final API / representation looks like)

@berkaysynnada
Copy link
Contributor

We've started to refactor. The design is complete, and the implementation is in progress.

Thanks! Is there anywhere I can follow along @berkaysynnada (I am particularly interested in what the final API / representation looks like)

I've reached you via discord

@alamb
Copy link
Contributor

alamb commented Jan 14, 2025

We've started to refactor. The design is complete, and the implementation is in progress.

Thanks! Is there anywhere I can follow along @berkaysynnada (I am particularly interested in what the final API / representation looks like)

I've reached you via discord

For anyone else who is interested, the draft PR in the synnada fork is here:

@gatesn
Copy link
Author

gatesn commented Jan 15, 2025

Looks like I got hit by some new ColumnStatistics tests on main. Should be fixed now 🤞

@berkaysynnada can you expand on the rationale for the V2 stats? I understand that it's more expressive, but I can't see in the PR or Notion how those distributions might actually be used? Is this for join planning?

My understanding is I would no longer define a "min" or a "max" for a column. But there doesn't seem to be a place to define null count or sum?

@berkaysynnada
Copy link
Contributor

Looks like I got hit by some new ColumnStatistics tests on main. Should be fixed now 🤞

@berkaysynnada can you expand on the rationale for the V2 stats? I understand that it's more expressive, but I can't see in the PR or Notion how those distributions might actually be used? Is this for join planning?

My understanding is I would no longer define a "min" or a "max" for a column. But there doesn't seem to be a place to define null count or sum?

You can still define min or max. We are not replacing Statistics with Statistics_v2; it is actually replacing the Precision and Interval objects. We plan to rename the API of the execution plan from fn statistics(&self) -> Statistics to fn statistics(&self) -> TableStatistics, which is still structured as:

pub struct TableStatistics {
    pub num_rows: Statistics,
    pub total_byte_size: Statistics,
    pub column_statistics: Vec<ColumnStatistics>,
}

and

pub struct ColumnStatistics {
    pub null_count: Statistics,
    pub max_value: Statistics,
    pub min_value: Statistics,
    pub distinct_count: Statistics,
}

What we are trying to address is how the current way of indeterminate quantities are handled in a target-dependent way. For example, if there’s a possibility of indeterminate statistics, it is stored as the mean value when the caller requires an estimate. However, if bounds are required, that indeterminism is stored as an interval.

Our goal is to consolidate all forms of indeterminism and structure them with a strong mathematical foundation. This way, every user can utilize the statistics in their intended way. We aim to preserve and sustain all possible helpful quantities wherever feasible.

We are also constructing a robust evaluation and back-propagation mechanism (similar to interval arithmetic, evaluate_bounds, and propagate_constraints). With this mechanism, any kind of expression—whether projection-based (evaluation only) or filter-based (evaluation followed by propagation)—will automatically resolve using the new statistics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate physical-expr Physical Expressions proto Related to proto crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants