Support interval multiplication and division by arbitrary numerics #6906

milevin · 2024-12-19T08:55:53Z

Which issue does this PR close?

Rationale for this change

This adds support for interval multiplication and division. The end goal is to enable this in DataFusion to make it on par with Postgres in this regard.

arrow-arith/src/numeric.rs

findepi · 2024-12-19T08:59:36Z

arrow-arith/src/numeric.rs

+        if right == 0 {
+            return Err(ArrowError::DivideByZero);
+        }
+        Ok((left as f64 / right as f64).round() as i32)


why in floats?

To avoid premature truncation in case this is an i32?

for division, would left / right as Self or something like that work?
ie remaining in integer arithemtics. for example for math like 10000000000000000000000000000000000001 / 1 going thru f64 will change the value.

arrow-arith/src/numeric.rs

tustvold · 2024-12-19T11:47:40Z

Might I suggest we just add support for f64 and i32 to start off with, this would drastically cut down on codegen, and make the code easier to parse. DF's type coercion machinery should be able to handle this transparently for users. In practice people will be performing these operations with scalars, and so this type coercion is effectively free.

As an aside, coercing decimals to floats will yield incorrect answers

findepi · 2024-12-19T14:20:14Z

As an aside, coercing decimals to floats will yield incorrect answers

that's exactly what would happen if interval * decimal is not implemented directly.
Is this argument to support f64, i64 and decimal here? (and leave out smaller integer and floating point types)

findepi · 2024-12-19T14:21:12Z

arrow-arith/src/numeric.rs

+    if !value.is_finite() || value > i32::MAX as f64 || value < i32::MIN as f64 {
+        Err(ArrowError::ComputeError(
+            "Division result out of i32 range".to_string(),
+        ))
+    } else {
+        Ok(value as i32)
+    }


is there a builtin faillible version of float_value as i32 that we could use instead of implementing the logic ourselves?

If this is approached in the way I described above you won't be converting floats to integers. When you get a floating point array representing the number of milliseconds and you need to convert that to IntervalDayTimeType you will be dividing the input to MILLIS_IN_A_DAY and round the rest which comfortably fits in a 32-bit integer.

tustvold · 2024-12-19T14:35:24Z

Is this argument to support f64, i64 and decimal here? (and leave out smaller integer and floating point types)

I think we should just support f64, i64 for now. This is all that is actually implemented anyway by this PR, we're just doing implicit coercion.

that's exactly what would happen if interval * decimal is not implemented directly.

Right, my point is this PR is not implementing interval * decimal it is implementing interval * float64(decimal) masquerading as interval * decimal. Such lossy conversion IMO should be opt-in and explicit.

I'd recommend we leave out decimal support for now, and if people want it we can add support as a separate PR (it will be non-trivial).

milevin · 2024-12-19T16:55:28Z

Might I suggest we just add support for f64 and i32 to start off with, this would drastically cut down on codegen, and make the code easier to parse.

Sounds good, I'll make the change! Could you please clarify the comment about codegen? At which point is the codegen invoked?

DF's type coercion machinery should be able to handle this transparently for users. In practice people will be performing these operations with scalars, and so this type coercion is effectively free.

You mean coercion from, say, f32 to f64? What are you referring to in this sentence?

tustvold · 2024-12-19T16:59:35Z

Could you please clarify the comment about codegen? At which point is the codegen invoked?

Rust relies on monomorophisation of generics, i.e. each distinct parametrisation of a generic generates a "copy" of that code. This can quickly add up and bloat binary size and compile times. See https://rustc-dev-guide.rust-lang.org/backend/monomorph.html

You mean coercion from, say, f32 to f64? What are you referring to in this sentence?

DataFusion has existing logic to coerce inputs to those compatible with a function signature, if we provide the ability to multiply interval * i64, DataFusion has existing machinery to generalize this to interval * (type coercable to i64), including from things like strings.

milevin · 2024-12-19T18:10:48Z

Note, in the last commit, I've implemented i32 (but not i64) per:

Might I suggest we just add support for f64 and i32 to start off with

felipecrv

My suggestions require a whole different approach, so you might wait for @tustvold to confirm this is the right direction. This is my first arrow-rs review, but this is exactly how I would approach in Arrow C++ with C++ templates that blow up in binary size similarly to Rust traits.

felipecrv · 2024-12-20T18:20:52Z

arrow-arith/src/numeric.rs

+        (Interval(unit), rhs) if rhs.is_numeric() && matches!(op, Op::Mul | Op::MulWrapping) =>
+            match unit {
+                YearMonth => interval_mul_op::<IntervalYearMonthType>(op, l, l_scalar, r, r_scalar),
+                DayTime => interval_mul_op::<IntervalDayTimeType>(op, l, l_scalar, r, r_scalar),
+                MonthDayNano => interval_mul_op::<IntervalMonthDayNanoType>(op, l, l_scalar, r, r_scalar),
+            },
+        (lhs, Interval(unit)) if lhs.is_numeric() && matches!(op, Op::Mul | Op::MulWrapping) =>
+            match unit {
+                YearMonth => interval_mul_op::<IntervalYearMonthType>(op, r, r_scalar, l, l_scalar),
+                DayTime => interval_mul_op::<IntervalDayTimeType>(op, r, r_scalar, l, l_scalar),
+                MonthDayNano => interval_mul_op::<IntervalMonthDayNanoType>(op, r, r_scalar, l, l_scalar),
+            },
+        (Interval(unit), rhs) if rhs.is_numeric() && matches!(op, Op::Div) =>
+            match unit {
+                YearMonth => interval_div_op::<IntervalYearMonthType>(op, l, l_scalar, r, r_scalar),
+                DayTime => interval_div_op::<IntervalDayTimeType>(op, l, l_scalar, r, r_scalar),
+                MonthDayNano => interval_div_op::<IntervalMonthDayNanoType>(op, l, l_scalar, r, r_scalar),
+            },


I think it makes more sense to keep these patterns concerned only with the lhs/rhs types and leaving the switch on op to interval_op instead of adding interval_mul_op/interval_div_op. Since the preparation to dispatching a mul or a div are similar, keeping it all inside interval_op and switching on op may reduce binary size.

felipecrv · 2024-12-20T19:05:10Z

arrow-arith/src/numeric.rs

+    fn mul_int(left: Self::Native, right: i32) -> Result<Self::Native, ArrowError>;
+    fn mul_float(left: Self::Native, right: f64) -> Result<Self::Native, ArrowError>;
+    fn div_int(left: Self::Native, right: i32) -> Result<Self::Native, ArrowError>;
+    fn div_float(left: Self::Native, right: f64) -> Result<Self::Native, ArrowError>;


Instead of instantiating a single-value interval [/*] number operation to build all the array operations we want, we can approach this in a way that avoids the combinatorial explosion.

To implement interval_array [/*] number operations we need operations that convert intervals to integers that count the number of the smallest unit of the interval type, regular int and float multiplication/divisions (which already exist), and then the conversion back to interval types. These conversion operations are parameterized by a single type so the number of specializations isn't a product of the number of interval types and the number of int and float types. All these operations are at the array level and not at the single-value level.

IntervalYearMonthType is already an int32 array (number of months) [1], so just fallback to int32 x ... kernels and convert the result to int32 number of months again (conversion might not even be needed depending on what rhs is)

IntervalDayTimeType is days and milliseconds (both int32) so if the whole array is converted to an int64 array of milliseconds you can delegate to regular * / and convert back the result

IntervalMonthDayNanoType similar idea.

[1] https://github.com/apache/arrow/blob/02a165922e46e5ed6dd3ed2446141cd0922a7c54/format/Schema.fbs#L398

Unfortunately intervals cannot be coerced to a duration as proposed, doing so changes their semantic meaning, 2 months is not the same as 60 days nor is 48 hours the same as 2 days (because of daylight savings). This has implications when performing arithmetic involving them

Unfortunately intervals cannot be coerced to a duration as proposed...

For the purpose of multiplication/division, some of the interval types can (internally to the kernel) -- before returning the output array, they are converted back to interval type to preserve their semantic meaning.

YEAR_MONTH

Example: the double of 1 year and 1 month is 2 years and 2 months which is what you get if you convert (1y, 1m) to 13m then 2x to 26m which should then become (2y, 2m). This is all fine because every year, no matter what, has 12 months. So 12 months can become a year in the interval. This is why the underlying implementation itself is in number of months.

Daylight savings and calendar consideration only come into play when you add an interval to a specific timestamp/date which is not the case here. It would be wrong to add 1y+1m to 2024-12-20 as just adding the number of seconds that interval in 30-day months converts to, but the operation we are doing here is the scaling of the interval by some factor while preserving the semantic meaning.

But maybe the multiplication by a float should not be available. Any system that wants to support interval[YEAR_MONTH]*float multiplication should round the float first or convert the interval to an interval type with more resolution.

DAY_TIME

IntervalDayTimeType expects the millis part to be less than a whole day worth of milliseconds (i.e. leap seconds are not accounted for) [1][2]. Perhaps the output of scaling operations on them (mul/div) should be a MONTH_DAY_NANO interval. Because unlike years that always have 12 months, days don't always have the same number of millis in them.

MONTH_DAY_NANO

For MonthDayNano my approach breaks indeed -- the components should be scaled independently. And the number of nanos is unbounded -- it doesn't have to be less than a day's worth of nanos.

Multiplication by float and division gets really confusing though.

[1] https://github.com/apache/arrow/blob/02a165922e46e5ed6dd3ed2446141cd0922a7c54/format/Schema.fbs#L398
[2] apache/arrow@7f7f72d

@bkietz what are your thoughts on multiplication and division of interval types? I remember you being involved in some discussion regarding these types in the past.

they are converted back to interval type to preserve their semantic meaning

Right but this conversion is not possible for anything other than YearMonth (which is just a number of moths IIRC).

For anything else this conversion is lossy as days and months do not correspond to a fixed duration.

some of the interval types can (internally to the kernel)

If by some you mean one 😅

The main point of my suggestion is to think of these operations in terms of whole-array operations instead of single-value operations.

To treat the interval components separately, we would only need functions that split an array of interval values into multiple arrays (each per component) so that all the arithmetic can be delegated to existing arithmetic operations for all int/float combinations.

It's a sound idea in principle, and this sort of approach to avoiding codegen is something I would generally advocate for, however, I don't think it works in this case.

In particular we don't support arithmetic operations between heterogeneous primitive types, e.g. int32 * float64, instead leaving such coercion to to query planners which are better placed to do this. The result is that we would effectively need the query engines to do this explosion for us, in order for things like interval * f64 to work.

Even discounting this, the performance characteristics of such an approach would be unfortunate, at that point you'd be better off using a StructArray

felipecrv · 2024-12-20T19:08:16Z

arrow-arith/src/numeric.rs

+    if !value.is_finite() || value > i32::MAX as f64 || value < i32::MIN as f64 {
+        Err(ArrowError::ComputeError(
+            "Division result out of i32 range".to_string(),
+        ))
+    } else {
+        Ok(value as i32)
+    }


If this is approached in the way I described above you won't be converting floats to integers. When you get a floating point array representing the number of milliseconds and you need to convert that to IntervalDayTimeType you will be dividing the input to MILLIS_IN_A_DAY and round the rest which comfortably fits in a 32-bit integer.

felipecrv · 2024-12-20T19:17:50Z

arrow-arith/src/numeric.rs

+/// Perform multiplication between an interval array and a numeric array
+fn interval_mul_op<T: IntervalOp>(
+    op: Op,
+    l: &dyn Array,
+    l_s: bool,
+    r: &dyn Array,
+    r_s: bool,
+) -> Result<ArrayRef, ArrowError> {


As I said in the first comment, these should become different cases in

fn interval_op<T: IntervalOp>( op: Op, l: &dyn Array, l_s: bool, r: &dyn Array, r_s: bool, ) -> Result<ArrayRef, ArrowError> { let l = l.as_primitive::<T>(); let r = r.as_primitive::<T>(); match op { Op::Add | Op::AddWrapping => Ok(try_op_ref!(T, l, l_s, r, r_s, T::add(l, r))), Op::Sub | Op::SubWrapping => Ok(try_op_ref!(T, l, l_s, r, r_s, T::sub(l, r))), -- NEW CASES HERE --

And then instead of relying on try_op_ref! that produces array operations based on single-value functions, you will first convert the interval array to a integer array (for both mul and div so no need for different traits), then delegate to either mul of div of numeric inputs (no interval types involved at this point), then you take that output and convert the desired interval type with the appropriate kernel. These conversion functions might be defined with try_op_ref!.

felipecrv · 2024-12-20T19:22:40Z

You mean coercion from, say, f32 to f64? What are you referring to in this sentence?

DataFusion has existing logic to coerce inputs to those compatible with a function signature, if we provide the ability to multiply interval * i64, DataFusion has existing machinery to generalize this to interval * (type coercable to i64), including from things like strings.

The approach I suggested doesn't rely on these Datafusion casts and doesn't increase binary size. It's a win-win.

Decimals can be covered by truncating the resulting decimals to integers [1] (e.g. number of millis or nanos) and then converting those int arrays back to the interval type.

[1] Kernels that might already exist or should exist (taking the integer part of a whole decimal array)

tustvold · 2024-12-20T19:55:43Z

The approach I suggested doesn't rely on these Datafusion casts and doesn't increase binary size. It's a win-win.

Unfortunately it is incorrect to treat intervals in this way - #6906 (comment)

I will try to find some time to re-review this PR over the next few days

milevin · 2024-12-20T22:15:57Z

The approach I suggested doesn't rely on these Datafusion casts and doesn't increase binary size. It's a win-win.

Unfortunately it is incorrect to treat intervals in this way - #6906 (comment)

I will try to find some time to re-review this PR over the next few days

Thanks a lot to both of you for the feedback!

I'll wait for @tustvold's additional review in the next few days and proceed based on the combination of recommendations.

Support interval multiplication and division by arbitrary numerics

255cf36

github-actions bot added the arrow Changes to the arrow crate label Dec 19, 2024

findepi reviewed Dec 19, 2024

View reviewed changes

arrow-arith/src/numeric.rs Outdated Show resolved Hide resolved

findepi reviewed Dec 19, 2024

View reviewed changes

arrow-arith/src/numeric.rs Outdated Show resolved Hide resolved

findepi reviewed Dec 19, 2024

View reviewed changes

arrow-arith/src/numeric.rs Outdated Show resolved Hide resolved

Make division safer; address PR feedback

be54b4a

findepi reviewed Dec 19, 2024

View reviewed changes

Simplifying to support only i32 and f64

4fc24a5

milevin added 2 commits December 19, 2024 10:24

Tidying up float - int switching

4f7628c

Fixing clippy -- useless conversion

6030e82

alamb mentioned this pull request Dec 19, 2024

Support multiply and divide on intervals apache/datafusion#13850

Open

2 tasks

felipecrv requested changes Dec 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support interval multiplication and division by arbitrary numerics #6906

Support interval multiplication and division by arbitrary numerics #6906

milevin commented Dec 19, 2024

findepi Dec 19, 2024

milevin Dec 19, 2024

findepi Dec 19, 2024

tustvold commented Dec 19, 2024 •

edited

Loading

findepi commented Dec 19, 2024

findepi Dec 19, 2024

felipecrv Dec 20, 2024

tustvold commented Dec 19, 2024

milevin commented Dec 19, 2024

tustvold commented Dec 19, 2024

milevin commented Dec 19, 2024 •

edited

Loading

felipecrv left a comment

felipecrv Dec 20, 2024

felipecrv Dec 20, 2024

tustvold Dec 20, 2024

felipecrv Dec 20, 2024

felipecrv Dec 20, 2024

tustvold Dec 20, 2024

felipecrv Dec 20, 2024

tustvold Dec 20, 2024

felipecrv Dec 20, 2024

felipecrv Dec 20, 2024

felipecrv commented Dec 20, 2024 •

edited

Loading

tustvold commented Dec 20, 2024 •

edited

Loading

milevin commented Dec 20, 2024

Support interval multiplication and division by arbitrary numerics #6906

Are you sure you want to change the base?

Support interval multiplication and division by arbitrary numerics #6906

Conversation

milevin commented Dec 19, 2024

Which issue does this PR close?

Rationale for this change

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Dec 19, 2024 • edited Loading

findepi commented Dec 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Dec 19, 2024

milevin commented Dec 19, 2024

tustvold commented Dec 19, 2024

milevin commented Dec 19, 2024 • edited Loading

felipecrv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YEAR_MONTH

DAY_TIME

MONTH_DAY_NANO

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felipecrv commented Dec 20, 2024 • edited Loading

tustvold commented Dec 20, 2024 • edited Loading

milevin commented Dec 20, 2024

tustvold commented Dec 19, 2024 •

edited

Loading

milevin commented Dec 19, 2024 •

edited

Loading

felipecrv commented Dec 20, 2024 •

edited

Loading

tustvold commented Dec 20, 2024 •

edited

Loading