-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support date_bin
on timestamps with timezone, properly accounting for Daylight Savings Time
#10602
Comments
The way you can perform this binning in postgres is somewhat paradoxically to convert a timestamp with a timezone back to a timestamp without timezone and then apply The syntax to convert a timestamp to something without timezone is 🤯 : you apply postgres=# select pg_typeof('2024-05-21T12:00:00'::timestamp AT TIME ZONE 'America/New_York');
pg_typeof
--------------------------
timestamp with time zone
(1 row)
postgres=# select pg_typeof('2024-05-21T12:00:00'::timestamp AT TIME ZONE 'America/New_York' AT TIME ZONE 'Europe/Brussels'
);
pg_typeof
-----------------------------
timestamp without time zone. <-- WTF no timezone as the result of AT TIME ZONE
(1 row) |
If we cast using arrow_cast back to > create or replace view t_roundtrip as select arrow_cast(column1, 'Timestamp(Nanosecond, None)') as "column1" from t;
0 row(s) fetched.
Elapsed 0.002 seconds.
> select * from t_roundtrip;
+---------------------+
| column1 |
+---------------------+
| 2024-01-01T00:00:01 | <--- timestamps now have no timezone
| 2024-02-01T00:00:01 |
| 2024-03-01T00:00:01 |
| 2024-04-01T00:00:01 |
| 2024-05-01T00:00:01 |
| 2024-06-01T00:00:01 |
| 2024-07-01T00:00:01 |
| 2024-08-01T00:00:01 |
| 2024-09-01T00:00:01 |
| 2024-10-01T00:00:01 |
| 2024-11-01T00:00:01 |
| 2024-12-01T00:00:01 |
+---------------------+
12 row(s) fetched.
Elapsed 0.001 seconds.
> select arrow_typeof(column1) as arrow_type, column1, date_bin(interval '1 day', column1) as "date_bin" from t_roundtrip;
+-----------------------------+---------------------+---------------------+
| arrow_type | column1 | date_bin |
+-----------------------------+---------------------+---------------------+
| Timestamp(Nanosecond, None) | 2024-01-01T00:00:01 | 2024-01-01T00:00:00 | <-- bins are as desired
| Timestamp(Nanosecond, None) | 2024-02-01T00:00:01 | 2024-02-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-03-01T00:00:01 | 2024-03-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-04-01T00:00:01 | 2024-04-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-05-01T00:00:01 | 2024-05-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-06-01T00:00:01 | 2024-06-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-07-01T00:00:01 | 2024-07-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-08-01T00:00:01 | 2024-08-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-09-01T00:00:01 | 2024-09-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-10-01T00:00:01 | 2024-10-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-11-01T00:00:01 | 2024-11-01T00:00:00 |
| Timestamp(Nanosecond, None) | 2024-12-01T00:00:01 | 2024-12-01T00:00:00 |
+-----------------------------+---------------------+---------------------+
12 row(s) fetched.
Elapsed 0.003 seconds. |
Given the statement in the description, here is the best I can come up with using -- Times in brussels
WITH t_brussels
AS (
SELECT
column1 AT TIME ZONE 'Europe/Brussels' as ts -- timestamp in specified timezone
FROM t_utc
)
SELECT
ts as "time in Brussels",
date_bin(
interval '1 day',
arrow_cast(ts, 'Timestamp(Nanosecond, None)'),
'2020-01-01T00:00:00Z'::timestamp
) as date_bin
FROM
t_brussels;
+---------------------------+---------------------+
| time in Brussels | date_bin |
+---------------------------+---------------------+
| 2024-04-30T23:30:00+02:00 | 2024-04-30T00:00:00 |
| 2024-05-01T00:30:00+02:00 | 2024-04-30T00:00:00 | <-- this is in the wrong bin
| 2024-05-01T01:30:00+02:00 | 2024-04-30T00:00:00 | <-- this in in the wrong bin
| 2024-05-01T02:00:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T02:30:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T12:30:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T22:30:00+02:00 | 2024-05-01T00:00:00 |
+---------------------------+---------------------+
7 row(s) fetched.
Elapsed 0.004 seconds. |
@mhilton and I agree that if we had the functionality suggested by @Abdullahsab3's on #10368 (comment)
We think we could get the right value out of For example, If we had a function like select remove_timezone("time in Brussels")
+---------------------------+---------------------+
| time in Brussels | remove_timezone |
+---------------------------+---------------------+
| 2024-04-30T23:30:00+02:00 | 2024-04-30T23:30:00 |
| 2024-05-01T00:30:00+02:00 | 2024-05-01T00:30:00 | <-- timezone removed (no +02:00, but value not adjusted)
| 2024-05-01T01:30:00+02:00 | 2024-05-01T01:30:00 |
| 2024-05-01T02:00:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T02:30:00+02:00 | 2024-05-01T02:30:00 |
| 2024-05-01T12:30:00+02:00 | 2024-05-01T12:30:00 |
| 2024-05-01T22:30:00+02:00 | 2024-05-01T22:30:00 |
+---------------------------+---------------------+ Note that this is different than -- Times in brussels
WITH t_brussels
AS (
SELECT
column1 AT TIME ZONE 'Europe/Brussels' as ts -- timestamp in specified timezone
FROM t_utc
)
SELECT
ts as "time in Brussels",
arrow_cast(ts, 'Timestamp(Nanosecond, None)') as arrow_cast
FROM
t_brussels;
+---------------------------+---------------------+
| time in Brussels | arrow_cast |
+---------------------------+---------------------+
| 2024-04-30T23:30:00+02:00 | 2024-04-30T21:30:00 |
| 2024-05-01T00:30:00+02:00 | 2024-04-30T22:30:00 | <-- note this is now in 2024-04-30 not 2024-05-01
| 2024-05-01T01:30:00+02:00 | 2024-04-30T23:30:00 |
| 2024-05-01T02:00:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T02:30:00+02:00 | 2024-05-01T00:30:00 |
| 2024-05-01T12:30:00+02:00 | 2024-05-01T10:30:00 |
| 2024-05-01T22:30:00+02:00 | 2024-05-01T20:30:00 |
+---------------------------+---------------------+
7 row(s) fetched.
Elapsed 0.003 seconds. |
My suggested next steps for this ticket:
|
I'd like to work on this issue 🙋♀️ |
The challenge with an approach that relies on stripping the timezone, is you end up back with the ambiguous timestamp issue. The UX of relying on users to strip the timezone is also rather unfortunate. I'm curious as to why we wouldn't want to just make date_bin timezone aware? This would appear to have a lot less potential edge cases |
I believe @mhilton had thought about this and maybe has some thoughts |
I'm not sure that it's obvious how a time-zone aware date_bin would behave. For example On the day in october when many time zones repeat an hour would the two hours count into a single bin? What about if you had hour long bins that start 30 minutes into the hour? Given that the two hours are easily distinguishable the answer is probably not, and if you really want that behviour then stripping time zone information would be a reasonable way to get it. If we restricted timezone awareness to intervals of unit days then that would be much more reasonable. We could make it so that date_bin understands the number of hours in any given day in a time zone and bins a time accordingly. Anyone needing a strict 24 hour binning could specify the interval in hour units. |
This would be the intuitive thing for it to do IMO, if people don't want this, they should cast to a timezone that doesn't have DST |
To make
|
I want to highlight this open-ended question and hear how others think |
Thanks for filing the ticket and for all the detailed explanations! very enriching I wonder whether the Postgres behavior is actually that bad. Though it looks weird, it still is generic enough to make it widely applicable. The issue with making The postgres way of converting UTC to local time of a given timezone is: select '2024-05-21T12:00:00Z'::timestamp AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels';
timezone
--
2024-05-21 14:00:00 As already mentioned by Andrew, the |
Thank you @tustvold and @Abdullahsab3 and @mhilton and @appletreeisyellow for the thoughts. From my perspective, the current (non timezone aware) Given the differences in underlying timestamp representation between arrow and postgres I do think some difference is likely inevitable and thus likely not a deal breaker. Here are my suggested next steps @appletreeisyellow tries to prototype one or both proposals and see if we can get it to produce the desired results:
I think the
|
I was looking into this issue again. I am still validating this, but I found that using |
@Abdullahsab3 Nice find! Seems like using I had been prototyping what @alamb suggested above since yesterday. I was able to do step 1. Note that the |
I am able to verified that select
arrow_typeof(column1) as arrow_type,
column1,
date_bin(interval '1 day', to_char(column1 AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels', '%F %X')) as "date_bin"
from t_timezone;
+------------------------------------------------+---------------------------+---------------------+
| arrow_type | column1 | date_bin |
+------------------------------------------------+---------------------------+---------------------+
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-01-01T00:00:01+01:00 | 2024-01-01T00:00:00 | <-- binned correctly
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-02-01T00:00:01+01:00 | 2024-02-01T00:00:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-03-01T00:00:01+01:00 | 2024-03-01T00:00:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-04-01T00:00:01+02:00 | 2024-04-01T00:00:00 | <-- binned correctly on daylight savings time
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-05-01T00:00:01+02:00 | 2024-05-01T00:00:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-06-01T00:00:01+02:00 | 2024-06-01T00:00:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-07-01T00:00:01+02:00 | 2024-07-01T00:00:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-08-01T00:00:01+02:00 | 2024-08-01T00:00:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-09-01T00:00:01+02:00 | 2024-09-01T00:00:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-10-01T00:00:01+02:00 | 2024-10-01T00:00:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-11-01T00:00:01+01:00 | 2024-11-01T00:00:00 |
| Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-12-01T00:00:01+01:00 | 2024-12-01T00:00:00 |
+------------------------------------------------+---------------------------+---------------------+
12 row(s) fetched.
Elapsed 0.024 seconds. select
column1 as "time in UTC",
column1 AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels' as "time in Brussels",
date_bin(interval '1 day', to_char(column1 AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels', '%F %X')) as "date_bin"
from t_utc;
+----------------------+---------------------------+---------------------+
| time in UTC | time in Brussels | date_bin |
+----------------------+---------------------------+---------------------+
| 2024-04-30T21:30:00Z | 2024-04-30T23:30:00+02:00 | 2024-04-30T00:00:00 |
| 2024-04-30T22:30:00Z | 2024-05-01T00:30:00+02:00 | 2024-05-01T00:00:00 | <-- binned correctly
| 2024-04-30T23:30:00Z | 2024-05-01T01:30:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T00:00:00Z | 2024-05-01T02:00:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T00:30:00Z | 2024-05-01T02:30:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T10:30:00Z | 2024-05-01T12:30:00+02:00 | 2024-05-01T00:00:00 |
| 2024-05-01T20:30:00Z | 2024-05-01T22:30:00+02:00 | 2024-05-01T00:00:00 |
+----------------------+---------------------------+---------------------+
7 row(s) fetched.
Elapsed 0.023 seconds. How to create the dataThe following code were given by Andrew from here. Copy here again for easier review. -- Dates right around the first of May
create or replace table t AS
VALUES
('2024-04-30T21:30:00'),
('2024-04-30T22:30:00'),
('2024-04-30T23:30:00'),
('2024-05-01T00:00:00'),
('2024-05-01T00:30:00'),
('2024-05-01T10:30:00'),
('2024-05-01T20:30:00')
;
create or replace view t_utc as
select column1::timestamp AT TIME ZONE 'UTC' as "column1"
from t; create or replace table t AS
VALUES
('2024-01-01T00:00:01'),
('2024-02-01T00:00:01'),
('2024-03-01T00:00:01'),
('2024-04-01T00:00:01'),
('2024-05-01T00:00:01'),
('2024-06-01T00:00:01'),
('2024-07-01T00:00:01'),
('2024-08-01T00:00:01'),
('2024-09-01T00:00:01'),
('2024-10-01T00:00:01'),
('2024-11-01T00:00:01'),
('2024-12-01T00:00:01')
;
create or replace view t_timezone
as
select column1::timestamp AT TIME ZONE 'Europe/Brussels' as "column1"
from t; |
Very nice! Thanks for verifying @appletreeisyellow. I was able to do some minimal validations today on our systems and it seems indeed to be binning correctly. Happy to see this as an alternative to what I was doing as a fix for this issue My main concerns with the use of
|
@appletreeisyellow and I just had a chat about what to do about this ticket. It is great that we seem to have found a workaround there are at least two potential issues we see with the workaround:
We think it would be great to improve DataFusion so future users (both of InfluxDB and other systems) who use Ideal BehaviorWe think the ideal user experience would be as @tustvold suggests for While this might be different than what postgres does, the underlying treatment if timestamps is already different, so I don't think we need to be quirk for quirk compatible. ProposalThus we propose @appletreeisyellow will try and complete her prototype work to make It is likely we can't devote a large amount of time to this project, so if the implementation is taking too long we may have to abandon this effort or go with a less optimal route (e.g. a function to strip off timezones like Postgres) |
I filed apache/arrow-rs#5826 to document more precisely what is going on with casting |
I wrote up a high-level doc on how to implement the |
Another thing that needs to be taken into consideration: I think the returned timestamp from ❯ select to_char('2024-04-30T02:00:00' at time zone 'UTC' at time zone 'Europe/Brussels', '%F %X')::timestamp;
+----------------------------------------------------+
| to_char(Utf8("2024-04-30T02:00:00"),Utf8("%F %X")) |
+----------------------------------------------------+
| 2024-04-30T04:00:00 |
+----------------------------------------------------+
1 row in set. Query took 0.002 seconds.
❯ select arrow_typeof(to_char('2024-04-30T02:00:00' at time zone 'UTC' at time zone 'Europe/Brussels', '%F %X')::timestamp);
+------------------------------------------------------------------+
| arrow_typeof(to_char(Utf8("2024-04-30T02:00:00"),Utf8("%F %X"))) |
+------------------------------------------------------------------+
| Timestamp(Nanosecond, None) |
+------------------------------------------------------------------+
1 row in set. Query took 0.003 seconds. When we add the timezone information: ❯ select to_char('2024-04-30T02:00:00' at time zone 'UTC' at time zone 'Europe/Brussels', '%F %X') at time zone 'Europe/Brussels';
+----------------------------------------------------+
| to_char(Utf8("2024-04-30T02:00:00"),Utf8("%F %X")) |
+----------------------------------------------------+
| 2024-04-30T04:00:00+02:00 |
+----------------------------------------------------+
1 row in set. Query took 0.001 seconds.
❯ select arrow_typeof(to_char('2024-04-30T02:00:00' at time zone 'UTC' at time zone 'Europe/Brussels', '%F %X') at time zone 'Europe/Brussels');
+------------------------------------------------------------------+
| arrow_typeof(to_char(Utf8("2024-04-30T02:00:00"),Utf8("%F %X"))) |
+------------------------------------------------------------------+
| Timestamp(Nanosecond, Some("Europe/Brussels")) |
+------------------------------------------------------------------+
1 row in set. Query took 0.001 seconds. The timezone information there is important, as the timestamp will be otherwise interpreted by external integrations (such as Grafana) as being in UTC. Note that applying TLDR; if you're using the |
Thanks for highlighting it. Yes, this is the expected behavior. The timezone representation of the returned timestamp will be the same as the input timestamp. For example:
select arrow_typeof(date_bin(interval '1 day', '2024-03-30T00:00:20Z'));
+---------------------------------------------------------------------------------------------------+
| arrow_typeof(date_bin(IntervalMonthDayNano("18446744073709551616"),Utf8("2024-03-30T00:00:20Z"))) |
+---------------------------------------------------------------------------------------------------+
| Timestamp(Nanosecond, None) |
+---------------------------------------------------------------------------------------------------+
select arrow_typeof(date_bin(interval '1 day', '2024-03-30T00:00:20'));
+--------------------------------------------------------------------------------------------------+
| arrow_typeof(date_bin(IntervalMonthDayNano("18446744073709551616"),Utf8("2024-03-30T00:00:20"))) |
+--------------------------------------------------------------------------------------------------+
| Timestamp(Nanosecond, None) |
+--------------------------------------------------------------------------------------------------+
select arrow_typeof(date_bin(interval '1 day', '2024-03-30T00:00:20Z' AT TIME ZONE 'UTC'));
+---------------------------------------------------------------------------------------------------+
| arrow_typeof(date_bin(IntervalMonthDayNano("18446744073709551616"),Utf8("2024-03-30T00:00:20Z"))) |
+---------------------------------------------------------------------------------------------------+
| Timestamp(Nanosecond, Some("UTC")) |
+---------------------------------------------------------------------------------------------------+
select arrow_typeof(date_bin(interval '1 day', '2024-03-30T00:00:20Z' AT TIME ZONE 'Europe/Brussels'));
+---------------------------------------------------------------------------------------------------+
| arrow_typeof(date_bin(IntervalMonthDayNano("18446744073709551616"),Utf8("2024-03-30T00:00:20Z"))) |
+---------------------------------------------------------------------------------------------------+
| Timestamp(Nanosecond, Some("Europe/Brussels")) |
+---------------------------------------------------------------------------------------------------+ |
I probably missed some prior discussion. Do we want |
I think we already have date_trunc
I think the difference between However it is a good question if |
Update: @appletreeisyellow merged an workaround -- She has no further current concrete plans to work on this ticket at this time |
FYI - I came across the Jiff crate the other day that looks to have some nice support for tz aware calendar handling. From the benchmarks I've run it's not quite as quick as chrono for some things and about equivalent to others. I would recommend anyone who wants to tackle this issue to have a close look at it to see if it solves some of the intricacies around tz conversion and DST arithmetic and rounding. |
Author of Jiff here. I don't have a ton of context on the specific problem in this issue, but I'd be happy to field questions about DST safe arithmetic. Jiff in particular is heavily inspired by Temporal, and Temporal (and Jiff) both intend to match RFC 5545 for duration handling. (It's still early days and I think there is a fair bit of room for improvement for Jiff in terms of perf. Both in micro-optimizing its existing code and in adding new APIs to facilitate faster ops when the use case demands it.) |
Is your feature request related to a problem or challenge?
Summary
Given this data in UTC timestamps
How to create the data
I would like to be able to get this output (bin the data by their day in Brussels, properly accounting for daylight savings time).
We can't figure out how to do this at the moment
Details
Broken out from @Abdullahsab3's great ticket #10368
We would like to apply date_bin and have it bin the dates in local time, properly including daylight savings time calculations
For example, given the following data at 12:01AM in the morning on the first day of each month,I would like to bin it into the first day of each month.
This works as expected when there is no timestamp (all timestamps end up in the correct day)
However, once the timestamp is in a timezone it has problems:
The timestamps now binned at incorrectly (the are binned based on UTC midnight, not midnight in Brussels) so for example a time in
2024-01-01
goes into a2023-12-31
bucket:The third argument to
date_bin
allows changing the starting offset for a day, so let's set it to midnight in Europe/Brussels time. This almost works but does not take into account daylight savings time so the timestamps are binned incorrectly during DST:Describe the solution you'd like
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: