Additional operator for `update_by` requested #5709

lbooker42 · 2024-07-03T00:36:53Z

To support production use cases, we need the following operators (also found in #4424):

median
percentile
var
cor
count_neg, count_pos
cum_std

But also needed are the following (supported by pandas / Polars):

last
rank
pct_change

chipkent · 2024-08-14T20:28:17Z

Other count operations like count_null, count_nan, etc. would be useful.

chipkent · 2024-08-19T14:39:47Z

As we have done in other cases, null values should be ignored, and NaN values are included -- typically resulting in poisoning.

chipkent · 2024-08-19T14:48:33Z

I looked through Pandas docs and found a few more operations that we should really support:

kurt / kurtosis
skew

chipkent · 2024-09-09T21:13:32Z

Below is an attempt at a more comprehensive and carefully curated list.

As has been the case for other operations:

null values are ignored in calculations.
NaN values are included in calculations. Typically, this means that NaN poisons results, so the operator will return NaN after seeing a NaN.
+0.0 and -0.0 are considered to be the same and equivalent.

Operators have a few different contexts:

agg
update_by cumulative
update_by window / rolling

Missing cumulative operators:

New operators (singleton)

delta_pct (Naming seems more consistent with the existing delta than the originally proposed pct_change)

New operators (agg, cumulative, and rolling):

Don't Do Operators? (Present in agg)

These are present in agg, but they may not be worth adding to the other cases until there is demand. They need some discussion.

[] distinct
[] unique
[] sorted_first
[] sorted_last

(?) There will be some debate on if this method should be implemented because of efficiency.
(*) May involve some tricky, careful numerics to compute good values. Need to be careful in defining the calculation.
(+) Not yet implemented in Numerics.ftl
(!) There has been some discussion around these operations with @rcaudy and @chipkent . cum_first/cum_last are the same as first_by/last_by, so there is an argument to not include them. offset is proposed as a way to get a value at a specific index or time offset instead of having a first/last operator. For time offsets, there needs to be a way to disambiguate if there are multiple values with the same time offset. offset would not be supported by agg, but first and last would.

chipkent · 2024-09-13T20:15:03Z

Details on computing skewness and excess kurtosis can be found at:

We want the sample skewness and sample excess kurtosis. The formulae used by Excel, SAS, etc. have probably been well vetted.

chipkent · 2024-09-13T20:16:02Z

Details on computing the sample covariance can be found at:

https://en.wikipedia.org/wiki/Covariance#Calculating_the_sample_covariance

lbooker42 added feature request New feature or request triage labels Jul 3, 2024

lbooker42 self-assigned this Jul 3, 2024

rcaudy added this to the 4. Unscheduled milestone Jul 3, 2024

rcaudy added query engine core Core development tasks and removed triage labels Jul 3, 2024

pete-petey added the 2023_unscheduled label Aug 26, 2024

pete-petey modified the milestones: 4. Unscheduled, 5. Backlog Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional operator for `update_by` requested #5709

Additional operator for `update_by` requested #5709

lbooker42 commented Jul 3, 2024

chipkent commented Aug 14, 2024

chipkent commented Aug 19, 2024

chipkent commented Aug 19, 2024

chipkent commented Sep 9, 2024 •

edited by lbooker42

Loading

chipkent commented Sep 13, 2024

chipkent commented Sep 13, 2024

Additional operator for update_by requested #5709

Additional operator for update_by requested #5709

Comments

lbooker42 commented Jul 3, 2024

chipkent commented Aug 14, 2024

chipkent commented Aug 19, 2024

chipkent commented Aug 19, 2024

chipkent commented Sep 9, 2024 • edited by lbooker42 Loading

Missing cumulative operators:

New operators (singleton)

New operators (agg, cumulative, and rolling):

Don't Do Operators? (Present in agg)

chipkent commented Sep 13, 2024

chipkent commented Sep 13, 2024

Additional operator for `update_by` requested #5709

Additional operator for `update_by` requested #5709

chipkent commented Sep 9, 2024 •

edited by lbooker42

Loading