-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add AsyncCatalogProvider
helpers for asynchronous catalogs
#13800
base: main
Are you sure you want to change the base?
Conversation
What's cache-then-plan approach? (The linked page doesn't include "cache"). |
datafusion/catalog/src/async.rs
Outdated
|
||
/// A schema provider that looks up tables in a cache | ||
/// | ||
/// This is created by the [`AsyncSchemaProvider::resolve`] method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does that mean the code auto generated ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I have changed the comment to Instances are created by...
. Is this more clear?
Err(DataFusionError::Execution(format!("Attempt to deregister table '{name}' with ResolvedSchemaProvider which is not supported"))) | ||
} | ||
|
||
fn table_exist(&self, name: &str) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn table_exist(&self, name: &str) -> bool { | |
fn table_exists(&self, name: &str) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method name is defined by the SchemaProvider
trait. Renaming it would be a breaking change and I don't think it is justified.
datafusion/catalog/src/async.rs
Outdated
let Some(schema) = schema else { continue }; | ||
|
||
if !schema.cached_tables.contains_key(reference.table()) { | ||
let resolved_table = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part can be factored out into separate helper method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
#[tokio::test] | ||
async fn test_defaults() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking if we use the cached tables should we have a tests for that? I mean that cached tables should reflect the most recent catalog state, if the table added/modified/dropped it should be reflected in the caches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed below
Perhaps I should avoid using the word cache. This is not a long lived multi-query cache. This is a single query cache meant to be thrown away after the query has completed. It is a very short-lived cache that is designed to avoid repeated lookups during multiple planning passes. Every query is still a "cold" query. It would be possible to create another longer-lived caching layer on top of this but I am not trying to solve that problem at the moment.
There is no concern for cache eviction / staleness here because this cache should not be kept longer than a single query. There is some possibility for a catalog change to happen in between reference lookups ( |
@westonpace So we agree on the need for this. |
datafusion/catalog/src/async.rs
Outdated
Ok(self.cached_tables.get(name).cloned()) | ||
} | ||
|
||
#[allow(unused_variables)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please avoid #[allow
attributes. (and if one is really needed, add a code comment why)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what you mean by I would personally expect a planner to cache lookups in the same way I expect a compiler to optimize away repeated calls to a constant method. Though I understand this is not how the synchronous planner works today. This is an optimization that benefits all engines and should work equally for all so it seems useful for the |
Agreed
IIUC, this PR adds new code only, so it's not taking away any capability, and it's also not adding any new behavior.
When i said "engine" i meant datafusion core. I would want the core to do what you described as "expect a planner to cache lookups" |
} | ||
} | ||
|
||
/// A trait for schema providers that must resolve tables asynchronously |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very cool. Thank you @westonpace
Using this trait I think we can simplify the example in #13722 significantly. I will give that a try and see what I can come up with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update, I see you already had planned to do so -- I missed #13722 (review)
AsyncCatalogProvider
wrappers to permit asynchronous catalogs
My understanding is that this PR adds a kind of "basic implementation of a remote catalog" that will almost certainly not be used for all systems (due the varying needs of caching as @findepi mentions mong others ) So perhaps we can update the documentation on the traits to make it clear that they provide a basic implementation for implementing a remote catalog that must be accessd asynchronously and that for more complex usecases such as more sophisticated caching, users can build their own implementation using the same CatalogProvider APIs? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @westonpace -- I think this PR looks great. 🏆 🏅
Also thank you @comphead and @findepi for the reviews
I spent a few minutes trying to adapt #13722 to use this API and while I did not finish, it was going quite well
All I think this PR needs is
- Update the
remote_catalog.rs
example to use these new helpers - Add a link in the docs of
AsyncCatalog
, etc to theremote_catalog.rs
example
/// | ||
/// If a table reference's catalog name does not match this name then the reference will be ignored | ||
/// when calculating the cached set of tables (this allows other providers to supply the table) | ||
fn catalog_name(&self) -> &str; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When trying this API out I didn't fully understand this API (or what i should return) -- maybe if it is optional / has an advanced usecase we could provide a default implementation
fn catalog_name(&self) -> &str; | |
fn catalog_name(&self) -> Option<&str> { return None } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was a bit torn on this anyways. The problem arises when the user wants to use an AsyncSchemaProvider or AsyncCatalogProvider as the top level catalog. In these cases it isn't clear what we should do with full / partial table references.
For example, if the user adds an AsyncSchemaProvider and then tries to resolve the query "SELECT * FROM weston.public.my_table" what should we do?
- We could just assume that all table references are intended for us. This works as long as this schema provider is the only provider registered. If there are multiple providers registered then we need to know which to use for a given table reference somehow.
- We could assume we don't match the table reference and we will only match bare references.
- Or we can require schema providers to supply their own name and the catalog name so that we can filter references that apply (this is what I do)
The main problem with the current approach is that users whose top level is an AsyncCatalogProviderList have to implement these methods even though they are meaningless (we will do the filtering in the higher level resolve function).
We should probably do whatever the synchronous planner does in this case but I just didn't know. If I register a schema provider with a SessionContext
and then call sql
with a full (not bare) table reference then does it apply the provider I registered or not?
} | ||
} | ||
|
||
/// A trait for schema providers that must resolve tables asynchronously |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update, I see you already had planned to do so -- I missed #13722 (review)
AsyncCatalogProvider
wrappers to permit asynchronous catalogsAsyncCatalogProvider
helpers for asynchronous catalogs
@findepi Ok, I understand what you mean by engine now. I agree that we could maybe move this kind of caching into the planner itself. If we did so I think we could keep the traits and just let the
I'm currently using a copy of these traits / structs in some LanceDB stuff. Our enterprise / cloud product has a simple catalog. We are not generally stressed too much about queries-per-second so my main goal has been adapting our (asynchronous) catalog into datafusion. There's some movement to add a (more polished) catalog to our OSS stuff as well. When it comes to SQL queries LanceDB is pretty much a frontend for datafusion and so whatever we use as a catalog we will need to integrate. |
Which issue does this PR close?
Closes #10339 .
Rationale for this change
As discussed in #13582 we do not actually want to make the schema providers asynchronous (the downstream changes are significant). Instead a cache-then-plan approach was outlined in #13714. This PR adds helpers which make it easier for users to follow the cache-then-plan approach.
This is hopefully just a first step. Eventually I would like to integrate these into
SessionContext
itself so that we can have methods likeregister_async_catalog_list
andSessionContext
will keep track of a list of asynchronous providers and take care of calling theresolve
method for the user. The entire process can then be entirely hidden from the user.What changes are included in this PR?
Adds helpers, which are exposed in
datafusion_catalog
but not yet integrated intoSessionContext
. Users can use them following the example outlined in #13714.Are these changes tested?
Yes.
Are there any user-facing changes?
New APIs only. No breaking changes or modifications to existing APIs.