-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add ConfigOptions to ScalarFunctionArgs #13527
base: main
Are you sure you want to change the base?
feat: Add ConfigOptions to ScalarFunctionArgs #13527
Conversation
There is a lot of file changes here but most of the important changes are in scalar_function.rs, There is a todo in expr_simplifier.rs that I would like feedback on. |
I plan to review this carefully tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Omega359 -- this is an epic plumbing exercise 🪠
The signature in ScalarFunctionArgs
is 👌 very nice
This PR seems to require config_options
to be cloned many times now. I wonder if it is possible to avoid that 🤔. I took a brief look and it seems to be somewhat challenging as SessionState allows mutable access to the underlying SessionConfig.
Maybe we could change the semantics so that SessionConfig
has a Arc<ConfigOptions>
which was cloned when it was modified (Arc::unwrap_or_clone()
style) 🤔
I also think the const evaluator does need the actual correct ConfigOptions for correctness
let physical_expr = | ||
datafusion_physical_expr::create_physical_expr(&expr, &df_schema, &props)?; | ||
let config_options = Arc::new(ConfigOptions::default()); | ||
let physical_expr = datafusion_physical_expr::create_physical_expr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems somewhat inevitable that creating a physical expr will require the config options
However, I also think threading through the config options down through to the physical creation will (finally) permit people to pass things from the session down to function implementations (I think @cisaacson also was trying to do this in the past)
@@ -283,10 +284,16 @@ async fn prune_partitions( | |||
|
|||
// TODO: Plumb this down |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This todo may have now be complete
@@ -336,6 +337,8 @@ pub struct ScalarFunctionArgs<'a> { | |||
// The return type of the scalar function returned (from `return_type` or `return_type_from_exprs`) | |||
// when creating the physical expression from the logical expression | |||
pub return_type: &'a DataType, | |||
// The config options which can be used to lookup configuration properties | |||
pub config_options: Arc<ConfigOptions>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
Ok(e) => e, | ||
Err(err) => return ConstSimplifyResult::SimplifyRuntimeError(err, expr), | ||
}; | ||
// todo - should the config options be the actual options here or is this sufficient? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the actual configuration options are needed here. Otherwise what will happen is that any function whose behavior relies on the ConfigOptions may have different behavior on columns and constants (or other expressions that can be constant folded)
Yes, it's a bit annoying. I was tempted to see if I could switch to &'a ConfigOptions everywhere. There is at least one 'real' (vs Arc::clone) clone for every query, possibly more as I haven't checked.
Certainly possible, I can attempt that.
I was afraid of that. I was avoiding it because of the signature changes it would required just about everywhere which would cause even more headaches for those systems trying to upgrade. |
Yeah, it is a tricky one for sure |
Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look |
…rgs_session_config
@alamb I did a quick attempt at implementing that however it breaks a commonly used method - SessionConfig.options_mut(). Not having that available breaks a bunch of stuff and while switching to SessionConfig.set(..) is quite possible it's not as clean. Trying with &ConfigOptions in ScalarFunctionExpr leads to lifetime hell in areas I have no idea how to overcome right now. As much as I want this feature I'm going to put it aside for now |
…config # Conflicts: # datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs # datafusion/proto/src/physical_plan/from_proto.rs
…Expr. This means no cloning of ConfigOptions except for ScalarFunctions.
I think this may be ready for review again. For this round I refactored the code to use &ConfigOptions everywhere except for ScalarFunctionExpr so the cost for cloning ConfigOptions is only incurred when creating a scalar UDF. |
…_args_session_config # Conflicts: # datafusion/physical-expr/src/equivalence/properties.rs
Examples check failure is transient I believe. |
I restarted the checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Omega359 -- I think this PR is a major step forward and we could merge it as is.
However, I feel strongly there are two things that should be improved soon (if not this PR then follow on ones). I left specific comments about each
- Don't copy
ConfigOptions
in every call toScalarFunction::create_physical_expr
- Add convenience methods to get default
&ConfigOptions
andArc<ConfigOptions>
which I think will help people upgrading to the next version of DataFusion quickly migrate their code
let result_exec_plan: Arc<dyn ExecutionPlan> = proto | ||
.try_into_physical_plan(&ctx, runtime.deref(), &composed_codec) | ||
.try_into_physical_plan(&ctx, config_options, runtime.deref(), &composed_codec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an API change, but I think it is required to thread the config options through as each argument is specific
We could potentially improve the try_into_physical_plan
API (as a follow on PR) to make it easier to update the API in the future using a trait or something like
So this would look something like
pub trait ProtobufContext {
/// return a function registry
fn function_registry(&self) -> &dyn FunctionRegistry;
/// return the runtime env
fn runtime_env(&self) -> &RuntimeEnv;
/// return the config options
fn config_options(&self) -> &ConfigOptions;
/// return extension codec
fn extension_codec(&self) -> &dyn PhysicalExtensionCodec;
}
impl AsExecutionPlan for protobuf::PhysicalPlanNode {
...
fn try_into_physical_plan(
&self,
registry: &dyn FunctionRegistry,
config_options: &ConfigOptions,
runtime: &RuntimeEnv,
extension_codec: &dyn PhysicalExtensionCodec,
) -> Result<Arc<dyn ExecutionPlan>> {
@@ -187,7 +188,9 @@ impl PruningStatistics for MyCatalog { | |||
fn create_pruning_predicate(expr: Expr, schema: &SchemaRef) -> PruningPredicate { | |||
let df_schema = DFSchema::try_from(schema.as_ref().clone()).unwrap(); | |||
let props = ExecutionProps::new(); | |||
let physical_expr = create_physical_expr(&expr, &df_schema, &props).unwrap(); | |||
let config_options = ConfigOptions::default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something we could potentially do to make this API slightly easier to use might be to create a static default ConfigOptions
Something like
impl ConfigOptions {
/// returns a reference to default ConfigOptions
pub fn default_singleton() -> &'static ConfigOptions
}
This would then make it easier to return &ConfigOptions
in various places when only the default was needed
For example, then in LocalCsvTableFunc
you could avoid having to thread the ConfigOptions
through as in that example having the actual config options isn't important
@@ -241,6 +241,7 @@ async fn prune_partitions( | |||
partitions: Vec<Partition>, | |||
filters: &[Expr], | |||
partition_cols: &[(String, DataType)], | |||
ctx: &SessionState, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend passing &ConfigOptions
directly here as the function only needs the config, not the entire session state
@@ -243,6 +292,7 @@ pub fn create_physical_expr( | |||
Arc::new(fun.clone()), | |||
input_phy_exprs.to_vec(), | |||
return_type, | |||
Arc::new(config_options.clone()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line worries me -- it means each distinct scalar function in the plan will get an entirely new copy of the ConfigOptions.
I think the Arc
should be passed in as the argument like this (I realize this will be a significant code change) so that the config options are copied at most once per plan
/// Create a physical expression for the UDF.
pub fn create_physical_expr(
fun: &ScalarUDF,
input_phy_exprs: &[Arc<dyn PhysicalExpr>],
input_schema: &Schema,
args: &[Expr],
input_dfschema: &DFSchema,
config_options: &Arc<ConfigOptions>, // <--- I think this should be an `Arc`
) -> Result<Arc<dyn PhysicalExpr>> {
I think that would also make it clearer that physical planning makes a read only copy of the configuration (Arc<ConfigOptions>
) that is then unchanged during execution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my main concern as well. I am unsure that this change should merged in as is to be honest as any fix will be just as disruptive as this PR is api wise.
Pushing the clone higher up the stack is possible but I did run into major issues trying to push it all the way up including all kinds of disruptive changes like changing OptimizerConfig from &ConfigOptions to the Arc version. Doing the opposite - pushing &ConfigOptions all the way down ran into issues with ScalarFunctionExpr and DynEq/DynHash.
Speaking for myself I prefer the latter as I feel it's less disruptive overall ... I just couldn't get it to work before. Changing the signature of DynHash and DynEq traits may allow it to work but I haven't tried that yet. Maybe with some more thought in January I can get it to work but I expect my January time to be pretty limited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushing the clone higher up the stack is possible but I did run into major issues trying to push it all the way up including all kinds of disruptive changes like changing OptimizerConfig from &ConfigOptions to the Arc version. Doing the opposite - pushing &ConfigOptions all the way down ran into issues with ScalarFunctionExpr and DynEq/DynHash.
If you have an Arc<ConfigOptions>
you can always get a &ConfigOptions
by calling options.as_ref()
So in other words, I don't know if you need to push the Arc through everywhere (at least at first) -- we could just change the create physical expr. I would love to help with this -- perhaps I can find time later in the week (though I have a few other things going on too)
Another option to avoid cloning ConfigOptions
for each instance of ScalarFunction
might be to add some sort of API / way for the function to communicate "I need the config options"
Which issue does this PR close?
Closes #13519
Rationale for this change
Allow udf's to access df config
What changes are included in this PR?
Code.
Are these changes tested?
Existing tests.
Are there any user-facing changes?
Not specifically, this is covered with the udf signature change in #13290