-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prototype implementing DataFusion functions / operators using arrow-udf
liibrary
#11413
Comments
take |
Sorry it takes longer than I expected to make this works end-to-end. I plan to make an ScalarUDF with From my perspective (feel free to correct me if I'm wrong), Good points:
Bad points:
Neural:
I'd think we could replace some string functions, that are not supported by datafusion/datafusion/physical-expr/src/expressions/binary.rs Lines 616 to 664 in f204869
An example would be // declare concat
#[function("concat(string, string) -> string")]
#[function("concat(largestring, largestring) -> largestring")]
fn concat(lhs: &str, rhs: &str) -> String {
format!("{}{}", lhs, rhs)
}
// reference concat
apply_udf(
&ColumnarValue::Array(left),
&ColumnarValue::Array(right),
&Field::new("", DataType::Utf8, true),
"concat",
) CC @alamb |
Btw... the
Will it be better to use RecordBatches instead of ColumnValues in PhysicalExpr evaluate function? It will provide finely integrations with arrow-rs eco system. |
One thing that ColumnarValue does well is represent single values efficiently (aka I don't see any fundamental reason we couldn't use RecordBatch if we figured out a better way to represent single row |
Thank you so much @xinlifoobar -- this is really helpful and a great analysis (I think the pros/cons you identified make a lot of sense to me) From what I can see, if we wanted to proceed with using Here are some additional discussions
I think this is part of the same concept as discussed on https://lists.apache.org/thread/x8wvlkfr0osl15o52rw85wom0p4v05x6 -- basically the arrow-udf library's scope is large enough to encompass things like a function registry that DataFusion already has
I do think being able to special case scalar value is a critical requirement for performance. I will post about your findings on the mailing lists and let's see what the authors of arrow-udf have to say |
FWIW this implementation of concat would likely perform pretty poorly compared to a hand written one as it will both create and allocate a new temporary // declare concat
#[function("concat(string, string) -> string")]
#[function("concat(largestring, largestring) -> largestring")]
fn concat(lhs: &str, rhs: &str) -> String {
format!("{}{}", lhs, rhs)
} |
In this case, a writer-style return value is supported to avoid the overhead. #[function("concat(string, string) -> string")]
#[function("concat(largestring, largestring) -> largestring")] // will be supported soon
fn concat(lhs: &str, rhs: &str, output: &mut impl std::fmt::Write) {
write!(output, "{}{}", lhs, rhs).unwrap();
} The same technique can be applied to // to be supported
#[function("concat(binary, binary) -> binary")]
#[function("concat(largebinary, largebinary) -> largebinary")]
fn concat(lhs: &[u8], rhs: &[u8], output: &mut impl std::io::Write) {
output.write_all(lhs).unwrap();
output.write_all(rhs).unwrap();
} |
This PR will add an additional meta parameter `visibility` to `arrow-udf`. I might want this to be added while working on apache/datafusion#11413. Sometimes it is better to reference the symbol directly instead of using the function registry. --------- Co-authored-by: Runji Wang <wangrunji0408@163.com>
To clarify, the arrow-udf project provides several things. (I just tried to explain it here arrow-udf/arrow-udf#55) For DataFusion, we are mainly interested in the |
I'm inspired by apache/datafusion#11413. It seems they just need the function macro, not like "UDF", which surprised me a little. --------- Signed-off-by: xxchan <xxchan22f@gmail.com>
the |
I understand we're not concerned whether I don't know how does |
The A new String object will be allocated if the #[function("concat(string, string) -> string")]
#[function("concat(largestring, largestring) -> largestring")]
fn concat(lhs: &str, rhs: &str, output: &mut impl std::fmt::Write) {
write!(output, "{}{}", lhs, rhs).unwrap();
} |
@wangrunji0408 thanks, that's valuable! I spend quite some time today benchmarking various hypothetical implementations where concatenation logic is maximally separated from columnar (scalar/array) handling. Here couple conclusions
Footnotes
|
BTW We are discussing something similar here: |
Thanks for pointing this out. |
That would make sense and was the rationale on the original inclusion from @JasonLi-cn (❤️ )
I think the other thing that is important is if the function produces null on a null input (which has some term but I can't remember now). With those two properties I do think you can skip most null handling. This PR from @kazuyukitanimura in arrow may also help improve the situation for normal arrow |
is this a null-call function? A null-call function is an SQL-invoked function that is defined to return the null value if any of its input argu- ments is the null value. A null-call function is an SQL-invoked function whose specifies “RETURNS NULL ON NULL INPUT”.
yes. many functions are "return null on null; return non-null on non-null". for these null handling can be externalized, and the function "business logic" abstracted. |
Indeed I was thinking of
Exactly! |
I think the prototype was done. Let's continue the discussion in #12635 |
Is your feature request related to a problem or challenge?
Related to the discussion on #11192 with @Xuanwo
RisingWave has a library for automatically creating vectorized implementations of functions (e.g. that operate on arrow arrays) from scalar implementations
The library is here: https://github.com/risingwavelabs/arrow-udf
A blog post describing it is here: https://risingwave.com/blog/simplifying-sql-function-implementation-with-rust-procedural-macro/
DataFusion uses macros to do something similar in binary.rs but they are pretty hard to read / understand in my opinon:
datafusion/datafusion/physical-expr/src/expressions/binary.rs
Lines 118 to 130 in 7a23ea9
One main benefit I can see to switching to https://github.com/risingwavelabs/arrow-udf is that we could then extend arrow-udf to support Dictionary and StringView and maybe other types to generate fast kernels for multiple different array layouts.
Describe the solution you'd like
I think it would be great if someone could evaluate the feasibility of using the macros in https://github.com/risingwavelabs/arrow-udf to implement Datafusion's operations (and maybe eventually functions etc)
Describe alternatives you've considered
I suggest a POC that picks one or two functions (maybe string equality or regexp_match or something) and tries to use
arrow-udf
s function macro instead.Here is an example of how to use it: https://docs.rs/arrow-udf/0.3.0/arrow_udf/
Additional context
No response
The text was updated successfully, but these errors were encountered: