feat: add `AsyncCatalogProvider` helpers for asynchronous catalogs #13800

westonpace · 2024-12-16T17:51:40Z

Which issue does this PR close?

Closes #10339 .

Rationale for this change

As discussed in #13582 we do not actually want to make the schema providers asynchronous (the downstream changes are significant). Instead a cache-then-plan approach was outlined in #13714. This PR adds helpers which make it easier for users to follow the cache-then-plan approach.

This is hopefully just a first step. Eventually I would like to integrate these into SessionContext itself so that we can have methods like register_async_catalog_list and SessionContext will keep track of a list of asynchronous providers and take care of calling the resolve method for the user. The entire process can then be entirely hidden from the user.

What changes are included in this PR?

Adds helpers, which are exposed in datafusion_catalog but not yet integrated into SessionContext. Users can use them following the example outlined in #13714.

Are these changes tested?

Yes.

Are there any user-facing changes?

New APIs only. No breaking changes or modifications to existing APIs.

…catalogs

findepi · 2024-12-17T11:48:30Z

Instead a cache-then-plan approach was outlined in #13714.

What's cache-then-plan approach? (The linked page doesn't include "cache").
How did we solve cold cache problem?

comphead · 2024-12-17T15:59:19Z

datafusion/catalog/src/async.rs

+
+/// A schema provider that looks up tables in a cache
+///
+/// This is created by the [`AsyncSchemaProvider::resolve`] method


does that mean the code auto generated ?

No. I have changed the comment to Instances are created by.... Is this more clear?

datafusion/catalog/src/async.rs

comphead · 2024-12-17T16:00:47Z

datafusion/catalog/src/async.rs

+        Err(DataFusionError::Execution(format!("Attempt to deregister table '{name}' with ResolvedSchemaProvider which is not supported")))
+    }
+
+    fn table_exist(&self, name: &str) -> bool {


Suggested change

fn table_exist(&self, name: &str) -> bool {

fn table_exists(&self, name: &str) -> bool {

This method name is defined by the SchemaProvider trait. Renaming it would be a breaking change and I don't think it is justified.

datafusion/catalog/src/async.rs

comphead · 2024-12-17T16:05:28Z

datafusion/catalog/src/async.rs

+            let Some(schema) = schema else { continue };
+
+            if !schema.cached_tables.contains_key(reference.table()) {
+                let resolved_table =


this part can be factored out into separate helper method?

comphead · 2024-12-17T16:10:05Z

datafusion/catalog/src/async.rs

+    }
+
+    #[tokio::test]
+    async fn test_defaults() {


I'm thinking if we use the cached tables should we have a tests for that? I mean that cached tables should reflect the most recent catalog state, if the table added/modified/dropped it should be reflected in the caches

Discussed below

westonpace · 2024-12-17T16:27:23Z

What's cache-then-plan approach? (The linked page doesn't include "cache").
How did we solve cold cache problem?

@findepi

Perhaps I should avoid using the word cache. This is not a long lived multi-query cache. This is a single query cache meant to be thrown away after the query has completed. It is a very short-lived cache that is designed to avoid repeated lookups during multiple planning passes. Every query is still a "cold" query. It would be possible to create another longer-lived caching layer on top of this but I am not trying to solve that problem at the moment.

I'm thinking if we use the cached tables should we have a tests for that? I mean that cached tables should reflect the most recent catalog state, if the table added/modified/dropped it should be reflected in the caches

@comphead

There is no concern for cache eviction / staleness here because this cache should not be kept longer than a single query. There is some possibility for a catalog change to happen in between reference lookups (resolve) and query execution. However, this will always be possible when using a remote catalog. The query execution should return an error from the remote endpoint saying "no database/schema/table found" or "query does not match schema". I'm not sure we can avoid this without some kind of synchronization mechanism with a remote catalog and I don't think there has been much work in that regard (but I admittedly haven't examined the APIs in great depth).

findepi · 2024-12-18T07:42:31Z

Perhaps I should avoid using the word cache. This is not a long lived multi-query cache. This is a single query cache meant to be thrown away after the query has completed

@westonpace
thanks for explaining. I think the use of cache is justified in this context and easier to understand than eg 'working set'. I agree this is important to have a notion of query-level information for two reasons. Performance is the obvious one: we should not repeatedly compute info we already knew. Second is correctness (consistency). If a query eg self-joins an Iceberg table T, the table may need to be read twice, but the reads should come from the same snapshot of T.

So we agree on the need for this.
The question is who's responsible for providing this consistency. Is this a catalog or table provider (eg it should self-wrap in ResolvedCatalogProvider), or is it the engine itself (then the question is how exactly this is impl'd)

findepi · 2024-12-18T07:40:03Z

datafusion/catalog/src/async.rs

+        Ok(self.cached_tables.get(name).cloned())
+    }
+
+    #[allow(unused_variables)]


please avoid #[allow attributes. (and if one is really needed, add a code comment why)

westonpace · 2024-12-18T13:51:08Z

The question is who's responsible for providing this consistency. Is this a catalog or table provider (eg it should self-wrap in ResolvedCatalogProvider), or is it the engine itself (then the question is how exactly this is impl'd)

I'm not sure I understand what you mean by it should self-wrap in ResolvedCatalogProvider.

I would personally expect a planner to cache lookups in the same way I expect a compiler to optimize away repeated calls to a constant method. Though I understand this is not how the synchronous planner works today.

This is an optimization that benefits all engines and should work equally for all so it seems useful for the resolve method to provide it. Is there some advantage of having every engine reimplement this pattern? Is there some functionality, customization or capability we are taking away from engines by doing this here?

findepi · 2024-12-19T08:35:54Z

I would personally expect a planner to cache lookups in the same way I expect a compiler to optimize away repeated calls to a constant method.

Agreed

Is there some functionality, customization or capability we are taking away from engines by doing this here?

IIUC, this PR adds new code only, so it's not taking away any capability, and it's also not adding any new behavior.
it adds building blocks for a desired behavior to be implemented later

Is there some advantage of having every engine reimplement this pattern?

When i said "engine" i meant datafusion core. I would want the core to do what you described as "expect a planner to cache lookups"
You mean "engines" in plural. How this new code is going to be used? In datafusion and/or elsewhere?

alamb · 2024-12-19T11:25:05Z

datafusion/catalog/src/async.rs

+    }
+}
+
+/// A trait for schema providers that must resolve tables asynchronously


this is very cool. Thank you @westonpace

Using this trait I think we can simplify the example in #13722 significantly. I will give that a try and see what I can come up with

Update, I see you already had planned to do so -- I missed #13722 (review)

alamb · 2024-12-19T12:32:43Z

What's cache-then-plan approach? (The linked page doesn't include "cache").
How did we solve cold cache problem?

@findepi

Perhaps I should avoid using the word cache. This is not a long lived multi-query cache. This is a single query cache meant to be thrown away after the query has completed.

My understanding is that this PR adds a kind of "basic implementation of a remote catalog" that will almost certainly not be used for all systems (due the varying needs of caching as @findepi mentions mong others )

So perhaps we can update the documentation on the traits to make it clear that they provide a basic implementation for implementing a remote catalog that must be accessd asynchronously and that for more complex usecases such as more sophisticated caching, users can build their own implementation using the same CatalogProvider APIs?

alamb

Thank you @westonpace -- I think this PR looks great. 🏆 🏅

Also thank you @comphead and @findepi for the reviews

I spent a few minutes trying to adapt #13722 to use this API and while I did not finish, it was going quite well

All I think this PR needs is

Update the remote_catalog.rs example to use these new helpers
Add a link in the docs of AsyncCatalog, etc to the remote_catalog.rs example

alamb · 2024-12-19T12:18:09Z

datafusion/catalog/src/async.rs

+    ///
+    /// If a table reference's catalog name does not match this name then the reference will be ignored
+    /// when calculating the cached set of tables (this allows other providers to supply the table)
+    fn catalog_name(&self) -> &str;


When trying this API out I didn't fully understand this API (or what i should return) -- maybe if it is optional / has an advanced usecase we could provide a default implementation

Suggested change

fn catalog_name(&self) -> &str;

fn catalog_name(&self) -> Option<&str> { return None }

Yeah, I was a bit torn on this anyways. The problem arises when the user wants to use an AsyncSchemaProvider or AsyncCatalogProvider as the top level catalog. In these cases it isn't clear what we should do with full / partial table references.

For example, if the user adds an AsyncSchemaProvider and then tries to resolve the query "SELECT * FROM weston.public.my_table" what should we do?

We could just assume that all table references are intended for us. This works as long as this schema provider is the only provider registered. If there are multiple providers registered then we need to know which to use for a given table reference somehow.

We could assume we don't match the table reference and we will only match bare references.

Or we can require schema providers to supply their own name and the catalog name so that we can filter references that apply (this is what I do)

The main problem with the current approach is that users whose top level is an AsyncCatalogProviderList have to implement these methods even though they are meaningless (we will do the filtering in the higher level resolve function).

We should probably do whatever the synchronous planner does in this case but I just didn't know. If I register a schema provider with a SessionContext and then call sql with a full (not bare) table reference then does it apply the provider I registered or not?

alamb · 2024-12-19T12:25:56Z

datafusion/catalog/src/async.rs

+    }
+}
+
+/// A trait for schema providers that must resolve tables asynchronously


Update, I see you already had planned to do so -- I missed #13722 (review)

westonpace · 2024-12-19T19:21:21Z

When i said "engine" i meant datafusion core. I would want the core to do what you described as "expect a planner to cache lookups"

@findepi Ok, I understand what you mean by engine now. I agree that we could maybe move this kind of caching into the planner itself. If we did so I think we could keep the traits and just let the resolve method deprecate and eventually go away.

You mean "engines" in plural. How this new code is going to be used? In datafusion and/or elsewhere?

I'm currently using a copy of these traits / structs in some LanceDB stuff. Our enterprise / cloud product has a simple catalog. We are not generally stressed too much about queries-per-second so my main goal has been adapting our (asynchronous) catalog into datafusion.

There's some movement to add a (more polished) catalog to our OSS stuff as well. When it comes to SQL queries LanceDB is pretty much a frontend for datafusion and so whatever we use as a catalog we will need to integrate.

Add asynchronous catalog traits to help users that have asynchronous …

e7cb49a

…catalogs

github-actions bot added the catalog Related to the catalog crate label Dec 16, 2024

This was referenced Dec 16, 2024

feat!: change catalog provider and schema provider methods to be asynchronous #13582

Closed

Add example of interacting with a remote catalog #13722

Merged

Apply clippy suggestions

372c920

comphead reviewed Dec 17, 2024

View reviewed changes

datafusion/catalog/src/async.rs Outdated Show resolved Hide resolved

comphead reviewed Dec 17, 2024

View reviewed changes

datafusion/catalog/src/async.rs Outdated Show resolved Hide resolved

comphead reviewed Dec 17, 2024

View reviewed changes

datafusion/catalog/src/async.rs Show resolved Hide resolved

comphead reviewed Dec 17, 2024

View reviewed changes

findepi reviewed Dec 18, 2024

View reviewed changes

westonpace added 2 commits December 18, 2024 06:08

Address PR reviews

bbc6908

Remove allow_unused exceptions

e090deb

alamb reviewed Dec 19, 2024

View reviewed changes

alamb changed the title ~~feat: add helpers for users with asynchornous catalogs~~ feat: add AsyncCatalogProvider wrappers to permit asynchronous catalogs Dec 19, 2024

alamb reviewed Dec 19, 2024

View reviewed changes

alamb changed the title ~~feat: add AsyncCatalogProvider wrappers to permit asynchronous catalogs~~ feat: add AsyncCatalogProvider helpers for asynchronous catalogs Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add `AsyncCatalogProvider` helpers for asynchronous catalogs #13800

feat: add `AsyncCatalogProvider` helpers for asynchronous catalogs #13800

westonpace commented Dec 16, 2024

findepi commented Dec 17, 2024

comphead Dec 17, 2024

westonpace Dec 18, 2024

comphead Dec 17, 2024

westonpace Dec 18, 2024

comphead Dec 17, 2024

westonpace Dec 18, 2024

comphead Dec 17, 2024

westonpace Dec 18, 2024

westonpace commented Dec 17, 2024

findepi commented Dec 18, 2024

findepi Dec 18, 2024

westonpace Dec 18, 2024 •

edited

Loading

westonpace commented Dec 18, 2024

findepi commented Dec 19, 2024

alamb Dec 19, 2024

alamb Dec 19, 2024

alamb commented Dec 19, 2024

alamb left a comment

alamb Dec 19, 2024

westonpace Dec 19, 2024

alamb Dec 19, 2024

westonpace commented Dec 19, 2024

	fn table_exist(&self, name: &str) -> bool {
	fn table_exists(&self, name: &str) -> bool {

	fn catalog_name(&self) -> &str;
	fn catalog_name(&self) -> Option<&str> { return None }

feat: add AsyncCatalogProvider helpers for asynchronous catalogs #13800

Are you sure you want to change the base?

feat: add AsyncCatalogProvider helpers for asynchronous catalogs #13800

Conversation

westonpace commented Dec 16, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

findepi commented Dec 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace commented Dec 17, 2024

findepi commented Dec 18, 2024

Choose a reason for hiding this comment

westonpace Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

westonpace commented Dec 18, 2024

findepi commented Dec 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Dec 19, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace commented Dec 19, 2024

feat: add `AsyncCatalogProvider` helpers for asynchronous catalogs #13800

feat: add `AsyncCatalogProvider` helpers for asynchronous catalogs #13800

westonpace Dec 18, 2024 •

edited

Loading