Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structured interval types for IntervalMonthDayNano or IntervalDayTime (#3125) (#5654) #5769

Merged
merged 4 commits into from
May 20, 2024

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented May 15, 2024

Which issue does this PR close?

Closes #3125
Closes #5654

Rationale for this change

The current interval implementation in rust is incorrect (the bit fields are wrong) as explained on #5654

Due to the fact that intervals are stored as i128 simply updating construction and access for reading the fields would have at least two unpleasant effects:

  1. Downstream code that use intervals might silently break / start working very differently
  2. Sorting would become somewhat nonsensical as

What changes are included in this PR?

  1. Make IntervalMonthDayNanoand IntervalDayTime structured types
  2. Update IntervalMonthDayNanoArray and IntervalDayTimeArray appropriately
  3. Remove cast support from the Int64Arrays to/from those types

Are there any user-facing changes?

Yes, this is a major breaking change for anyone who uses IntervalMonthDayNano or IntervalDayTime

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels May 15, 2024

/// Computes the absolute value
#[inline]
pub fn wrapping_abs(self) -> Self {
Copy link
Contributor Author

@tustvold tustvold May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These arithmetic operations are perhaps a little odd, but it's necessary to implement them in order to keep the type machinery happy

@tustvold tustvold force-pushed the structured-interval branch 2 times, most recently from f5a5127 to 30ae103 Compare May 15, 2024 18:29
}

fn as_usize(self) -> usize {
(self.months as usize) | ((self.days as usize) << 32)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit contrived, but tests make use of this logic so we need to at least partially support it

@tustvold tustvold force-pushed the structured-interval branch from 30ae103 to 416958a Compare May 15, 2024 18:31
@tustvold tustvold force-pushed the structured-interval branch from efdc38f to 25bf96a Compare May 15, 2024 19:03
@tustvold tustvold added the api-change Changes to the arrow API label May 16, 2024
Copy link
Contributor

@crepererum crepererum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the structured types, thanks a lot.

│ (64 bits) │ (32 bits) │ (32 bits) │
└──────────────────────────────┴─────────────┴──────────────┘
0 63 95 127 bit offset
┌───────────────┬─────────────┬─────────────────────────────┐
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remind me: do we have intergration tests that would find this issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and they're now actually testing the logic rather than doing the mapping themselves

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, the changes to arrow-integration-test/src/lib.rs in this PR

@alamb alamb changed the title Structured interval type (#3125) (#5654) Structured interval types for IntervalMonthDayNano or IntervalDayTime (#3125) (#5654) May 17, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tustvold -- I went through this PR and it looks very nice to me -- while it is a breaking change given the existing implementation is wrong I think this will at least make sure downstream users understand that.

I also filled out the PR description so when people come here they have some idea of what the context was and some of the rationale behind the decisions. It would be good if you could double check to make sure I am not mis-representing.

Thank you @crepererum for the review

cc @berkaysynnada and @ozankabak and @avantgardnerio who may be interested

native_type_op!(IntervalDayTime, IntervalDayTime::ZERO, IntervalDayTime::ONE);
native_type_op!(
IntervalMonthDayNano,
IntervalMonthDayNano::ZERO,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered if this needed any documentation, but it appears the answer is it already has some nice documentation 👍

Screenshot 2024-05-17 at 12 47 36 PM

│ (64 bits) │ (32 bits) │ (32 bits) │
└──────────────────────────────┴─────────────┴──────────────┘
0 63 95 127 bit offset
┌───────────────┬─────────────┬─────────────────────────────┐
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, the changes to arrow-integration-test/src/lib.rs in this PR

@@ -275,11 +275,6 @@ pub fn can_cast_types(from_type: &DataType, to_type: &DataType) -> bool {
DayTime => false,
MonthDayNano => false,
},
(Int64, Interval(to_type)) => match to_type {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This removes direct casting support for Int64Array to IntervalArray (which presumably was interpreting the integers incorrectly anyways)

What should someone do who uses that cast in their code today?

Something like

let cast_array = IntervalArray::from_iter(
  int64_array
    .values()
    .map(|v| IntervalDayTime::from_parts(v<<32, v&0xffffffff)
    .collect()

?

Copy link
Contributor Author

@tustvold tustvold May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or more optimally

let a: IntervalDayTimeArray = int64_array.unary(|x| IntervalDayTime::new((x << 32) as i32, x as i32));

Although the exact specifics will depend on if they're wanting to emulate the old (incorrect) byte order or the new 😅

This ambiguity is why I removed it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as there is some way to get the old behavior (if desired) I think removing the built in API seems reasonable

// specific language governing permissions and limitations
// under the License.

macro_rules! derive_arith {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you possibly document this a bit to give future readers about its purpose ?

@@ -387,6 +398,25 @@ pub fn array_from_json(
1 => b.append_value(match value {
Value::Number(n) => n.as_i64().unwrap(),
Value::String(s) => s.parse().expect("Unable to parse string as i64"),
_ => panic!("Unable to parse {value:?} as number"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The panics look suspicious to me, but the rest of the code follows the same pattern

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this code was originally written solely for tests, and so is quite panic happy

assert_eq!(Ordering::Greater, cmp(3, 1));
assert_eq!(Ordering::Greater, cmp(3, 2));
assert_eq!(Ordering::Less, cmp(0, 0)); // v1 vs v3
assert_eq!(Ordering::Equal, cmp(0, 3)); // v1 vs v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to trace through the double layer of indirection here. Looks good

@@ -163,6 +165,44 @@ impl FixedLengthEncoding for f64 {
}
}

impl FixedLengthEncoding for IntervalDayTime {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds support for the new structured types in the row encoder 👍

@ozankabak
Copy link

This LGTM, but @berkaysynnada will have a better idea on the work necessary on the DF side

@tustvold tustvold merged commit cf59b6c into apache:master May 20, 2024
27 checks passed
Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request Jun 13, 2024
andygrove pushed a commit to apache/datafusion-python that referenced this pull request Jun 14, 2024
* deps: update datafusion to 39.0.0, pyo3 to 0.21, and object_store to 0.10.1

`datafusion-common` also depends on `pyo3`, so they need to be upgraded together.

* feat: remove GetIndexField

datafusion replaced Expr::GetIndexField with a FieldAccessor trait.

Ref apache/datafusion#10568
Ref apache/datafusion#10769

* feat: update ScalarFunction

The field `func_name` was changed to `func` as part of removing `ScalarFunctionDefinition` upstream.

Ref apache/datafusion#10325

* feat: incorporate upstream array_slice fixes

Fixes #670

* update ExectionPlan::children impl for DatasetExec

Ref apache/datafusion#10543

* update value_interval_daytime

Ref apache/arrow-rs#5769

* update regexp_replace and regexp_match

Fixes #677

* add gil-refs feature to pyo3

This silences pyo3's deprecation warnings for its new Bounds api.

It's the 1st step of the migration, and should be removed before merge.

Ref https://pyo3.rs/v0.21.0/migration#from-020-to-021

* fix signature for octet_length

Ref apache/datafusion#10726

* update signature for covar_samp

AggregateUDF expressions now have a builder API design, which removes arguments like filter and order_by

Ref apache/datafusion#10545
Ref apache/datafusion#10492

* convert covar_pop to expr_fn api

Ref: https://github.com/apache/datafusion/pull/10418/files

* convert median to expr_fn api

Ref apache/datafusion#10644

* convert variance sample to UDF

Ref apache/datafusion#10667

* convert first_value and last_value to UDFs

Ref apache/datafusion#10648

* checkpointing with a few todos to fix remaining compile errors

* impl PyExpr::python_value for IntervalDayTime and IntervalMonthDayNano

* convert sum aggregate function to UDF

* remove unnecessary clone on double reference

* apply cargo fmt

* remove duplicate allow-dead-code annotation

* update tpch examples for new pyarrow interval

Fixes #665

* marked q11 tpch example as expected fail

Ref #730

* add default stride of None back to array_slice
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rust Interval definition incorrect Structured Interval Native Type
4 participants