Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-cluster support #378

Open
canbekley opened this issue Nov 4, 2024 · 1 comment
Open

Multi-cluster support #378

canbekley opened this issue Nov 4, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@canbekley
Copy link
Contributor

canbekley commented Nov 4, 2024

ClickHouse supports accessing remote clusters via ON CLUSTER statements to any cluster configured via remote_servers section. Distributed tables can be repurposed to point to tables in any cluster. For a multi-cluster architecture, I would like dbt-clickhouse to be able to create distributed tables on a remote cluster, pointing to the actual tables in the local cluster.

An example distributed table could look as following:

CREATE TABLE `db.tablename` ON CLUSTER `remote_cluster` (
    ...
) ENGINE = Distributed(`local_cluster`, db, tablename)

I would suggest this additional layer independent of materialized and incremental_strategy, for distributed and non-distributed materializations. That would mean, for distributed_table materialization, there would be a need for two distributed tables (one on the local cluster and one on the remote cluster) as well as one "local" table (on the local cluster);

For usage, I propose adding remote_clusters as an optional list parameter to profiles, and a add_to_remote_clusters boolean flag as a model configuration.

Functional requirements could be the following:

  • when no remote_clusters are configured, current functionality is unchanged
  • materialization will fail when remote_clusters are configured, but the current clickhouse host doesn’t know all of the clusters
  • materialization will fail when add_to_remote_clusters is set to true, but no remote_clusters are configured

when remote_clusters are configured and add_to_remote_clusters is set to true then ..

  • materializations will create additional distributed tables on the remote clusters, pointing to the local tables.
  • databases are created correctly on remote clusters
  • schemas are updated consistently on local and remote clusters

Feel free to assign me directly to it.

@canbekley canbekley added the enhancement New feature or request label Nov 4, 2024
@pheepa
Copy link

pheepa commented Nov 29, 2024

Hi!
I like the idea, I'd love to work on it
Do you think this configuration inside the model is enough or should I inherit this from profiles.yml? Do the tests need to be modified or will testing for the two clusters test_shard and test_replica suffice?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants