-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-36954: [Python] Add more FlightInfo / FlightEndpoint attributes #43537
GH-36954: [Python] Add more FlightInfo / FlightEndpoint attributes #43537
Conversation
|
python/pyarrow/_flight.pyx
Outdated
f"locations={self.locations!r}>") | ||
f"locations={self.locations!r} " | ||
f"expiration_time={self.expiration_time} " | ||
f"app_metadata='{self.app_metadata.hex()}'>") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This matches the C++ implementation of ToString
. In Python, we could easily use self.app_metadata
, which represents the metadata bytes in ASCII with automatic escaping of non-printable characters, e.g. b'abc\xe2\x80\xa6'
, which is more readable in the absence of non-printable characters than .hex()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we don't need to use the same style as C++ in Python.
So we can remove .hex()
.
@jorisvandenbossche What do you think about this?
python/pyarrow/_flight.pyx
Outdated
total_records : int optional, default None | ||
the total records in this flight, or -1 if unknown. | ||
total_bytes : int optional, default None | ||
the total bytes in this flight, or -1 if unknown. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While adding new attributes ordered
and app_metadata
as optional arguments, which simplifies user and test code, making existing total_records
and total_bytes
optional sounds like a good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using -1
not None
as the default value because the document says -1 if unknown
not None if unknown
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will use -1
though it is less pythonic, as it is more consistent with C++.
python/pyarrow/_flight.pyx
Outdated
@@ -950,7 +1005,9 @@ cdef class FlightInfo(_Weakrefable): | |||
f"descriptor={self.descriptor} " | |||
f"endpoints={self.endpoints} " | |||
f"total_records={self.total_records} " | |||
f"total_bytes={self.total_bytes}>") | |||
f"total_bytes={self.total_bytes} " | |||
f"ordered={'true' if self.ordered else 'false'} " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This matches the C++ implementation of ToString
. In Python, you would expect True
or False
though. Happy to change this to
f"ordered={'true' if self.ordered else 'false'} " | |
f"ordered={self.ordered} " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we can use True
/False
in Python.
python/pyarrow/tests/test_flight.py
Outdated
"locations=[]>") | ||
"locations=[] " | ||
"expiration_time=2023-04-05 12:34:56+00:00 " | ||
"app_metadata='656e64706f696e7420617070206d65746164617461'>") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could read (which is more readable)
app_metadata=b'endpoint app metadata'
if we wouldn't stick to the C++ implementation of ToString
here.
python/pyarrow/tests/test_flight.py
Outdated
"total_bytes=-1>") | ||
"total_records=1 " | ||
"total_bytes=42 " | ||
"ordered=true " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Python, you would expect
ordered=True
here.
python/pyarrow/_flight.pyx
Outdated
f"locations={self.locations!r}>") | ||
f"locations={self.locations!r} " | ||
f"expiration_time={self.expiration_time} " | ||
f"app_metadata='{self.app_metadata.hex()}'>") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we don't need to use the same style as C++ in Python.
So we can remove .hex()
.
@jorisvandenbossche What do you think about this?
python/pyarrow/_flight.pyx
Outdated
total_records : int optional, default None | ||
the total records in this flight, or -1 if unknown. | ||
total_bytes : int optional, default None | ||
the total bytes in this flight, or -1 if unknown. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using -1
not None
as the default value because the document says -1 if unknown
not None if unknown
?
python/pyarrow/_flight.pyx
Outdated
@@ -950,7 +1005,9 @@ cdef class FlightInfo(_Weakrefable): | |||
f"descriptor={self.descriptor} " | |||
f"endpoints={self.endpoints} " | |||
f"total_records={self.total_records} " | |||
f"total_bytes={self.total_bytes}>") | |||
f"total_bytes={self.total_bytes} " | |||
f"ordered={'true' if self.ordered else 'false'} " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we can use True
/False
in Python.
python/pyarrow/_flight.pyx
Outdated
@@ -736,6 +743,12 @@ cdef class FlightEndpoint(_Weakrefable): | |||
CLocation.Parse(tobytes(location)).Value(&c_location)) | |||
self.endpoint.locations.push_back(c_location) | |||
|
|||
if expiration_time is not None: | |||
self.endpoint.expiration_time = time_point( | |||
microseconds(expiration_time.cast(timestamp("us")).value)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is microseconds correct?
I think that we should use nanoseconds:
arrow/cpp/src/arrow/flight/types.h
Lines 76 to 88 in 8068554
/// \brief A timestamp compatible with Protocol Buffer's | |
/// google.protobuf.Timestamp: | |
/// | |
/// https://protobuf.dev/reference/protobuf/google.protobuf/#timestamp | |
/// | |
/// > A Timestamp represents a point in time independent of any time | |
/// > zone or calendar, represented as seconds and fractions of | |
/// > seconds at nanosecond resolution in UTC Epoch time. It is | |
/// > encoded using the Proleptic Gregorian Calendar which extends the | |
/// > Gregorian calendar backwards to year one. It is encoded assuming | |
/// > all minutes are 60 seconds long, i.e. leap seconds are "smeared" | |
/// > so that no leap second table is needed for interpretation. Range | |
/// > is from 0001-01-01T00:00:00Z to 9999-12-31T23:59:59.999999999Z. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer nanoseconds as well, but that did not work on macOS:
/Users/runner/work/arrow/arrow/build/python/build/temp.macosx-10.9-universal2-cpython-311/_flight.cpp:31893:102: error: no matching conversion for functional-style cast from 'std::chrono::nanoseconds' (aka 'duration<long long, ratio<1LL, 1000000000LL>>') to 'std::chrono::system_clock::time_point' (aka 'time_point<std::chrono::system_clock>')
__pyx_v_self->endpoint.expiration_time = ((std::optional<std::chrono::system_clock::time_point> )std::chrono::system_clock::time_point(std::chrono::nanoseconds(__pyx_t_12)));
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
https://github.com/EnricoMi/arrow/actions/runs/10083176212/job/27879168275#step:8:3720
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use std::chrono::duration_cast<Timestamp::duration>()
like
arrow/cpp/src/arrow/flight/flight_internals_test.cc
Lines 194 to 195 in 1992cc6
Timestamp expiration_time( | |
std::chrono::duration_cast<Timestamp::duration>(expiration_time_duration)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That did the trick, thanks for the pointer!
arrow/python/pyarrow/_flight.pyx
Lines 746 to 748 in 501d3a6
if expiration_time is not None: | |
self.endpoint.expiration_time = time_point(duration_cast[time_point.duration]( | |
nanoseconds(expiration_time.cast(timestamp("ns")).value))) |
+1 Looking forward for this being merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No additional comments from me.
0506ead
to
e5ece4b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just some minor comments
python/pyarrow/_flight.pyx
Outdated
total_records : int optional, default -1 | ||
the total records in this flight, or -1 if unknown. | ||
total_bytes : int optional, default -1 | ||
the total bytes in this flight, or -1 if unknown. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None
would be a preferrable default for a Python API, IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Me too, but the attributes will have to return -1
(because of the C++ backend), so that would be a bit inconsistent. @kou your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Python attributes can (and probably should?) return None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, that would be pythonic and consistent (with None default values). But that might be a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None isn't a real value for these. (It's also inconsistent with the underlying protobuf, which doesn't really have optional values, and would default to 0 for a truly unspecified value.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what a "real value" is? None
is the standard placeholder in Python for missing or unknown data. I agree that it would be a slight compatibility break to start returning None
instead of -1, but that's for the better IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The underlying structure does not admit a null value. It is Protobuf, there are no true null values (*this has changed slightly in recent versions but we do not use that codegen option for this structure).
Let's be clear here, though. Is this for input, or output? I would be -1 to ever returning None
for this attribute. If it's merely to accept it as an input on construction that might be OK, but it would be misleading to people when it doesn't round trip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not affect the underlying structure. The Python FlightEnpoint
can take None
values and translate to -1
for the underlying C++ class. The getter methods can translate those -1
back to None
, so this Python class feels pythonic.
Using -1
in constructor would continue to work while getters would return None
.
User code like if endpoint.total_records == -1
would break. This would have to be migrated to if endpoint.total_records is None
, which is for the better, as @pitrou phrased it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The getter methods can translate those -1 back to None, so this Python class feels pythonic.
This is what I would object to.
The Python FlightEnpoint can take None values and translate to -1 for the underlying C++ class.
I don't mind this though it should be clearly documented (as the underlying Protobuf does not take None).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To summarize, the constructor allows for -1
and None
, the properties in both cases return -1
. Added test to highlight this.
python/pyarrow/_flight.pyx
Outdated
if self.endpoint.expiration_time.has_value(): | ||
time_since_epoch = duration_cast[nanoseconds]( | ||
self.endpoint.expiration_time.value().time_since_epoch()).count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have some existing helpers defined for time point to nanoseconds conversion:
inline int64_t TimePoint_to_ns(TimePoint val) { return val.time_since_epoch().count(); } |
(and similar for the other way around, TimePoint_from_ns
)
That could maybe be used here? (didn't check in detail if it applies here, but they are already declared in libarrow_python.pxd, and then the new chrono.pxd might not be needed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checking, thanks for the pointer!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simplified constructor and def expiration_time
, could removed chrono.pxd
with that
This reverts commit 84bbe5f.
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
0b49a09
to
94a40b2
Compare
@lidavidm thanks! |
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit b193c4f. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Fixing the |
…ound from #43537 (#44681) ### What changes are included in this PR? Make `arrow.flight.Timestamp` nanoseconds precision and OS-independent. Use the `CTimePoint` for `FlightEndpoint.expiration_time` to have OS-independent nanoseconds precision. https://github.com/apache/arrow/blob/093655c60783321c786a8e69632f185d37520f4d/python/pyarrow/includes/libarrow_flight.pxd#L133-L139 This reverts workaround for `expiration_time` from #43537. ### Are these changes tested? Existing unit tests are adjusted to the new precision. ### Are there any user-facing changes? **This PR includes breaking changes to public APIs.** The values of `FlightEndpoint.expiration_time` now have nanoseconds precision on any operating system. * GitHub Issue: #44679 Authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: David Li <li.davidm96@gmail.com>
Rationale for this change
The C++ classes
FlightInfo
andFlightEndpoint
have attributes that are not available via the Python API.What changes are included in this PR?
Make the following attributes available in Python:
FlightInfo.ordered
FlightInfo.app_metadata
FlightEndpoint.expiration_time
FlightEndpoint.app_metadata
Also makes existing attributes optional in constructor:
FlightInfo.total_records
FlightInfo.total_bytes
Are these changes tested?
Existing tests that test existing attributes are extended.
Are there any user-facing changes?
Yes, changes are backward compatible.