Cluster setting and Distributed Table tests #186

gfunc · 2023-08-23T10:20:13Z

Summary

added attribute should_on_cluster for ClickHouseRelation class to indicate whether theon_cluster_clause macro should return content.
added cluster setting for tests.
debug marcos.
added tests for distributed_table materialization and replicated table engines.
changed github workflow test to use docker-compose which starts a 3 nodes ClickHouse cluster

Checklist

Delete items not relevant to your PR:

Unit and integration tests covering the common scenarios were added

Caveats

~~tests for distributed_incremental is not added yet.~~

CLAassistant · 2023-08-23T10:20:18Z

All committers have signed the CLA.

zli06160

👍 The modifications in dbt/include/clickhouse/macros/materializations/distributed_table.sql do make sense.
We should indeed

define the default sharding_key, cluster, etc;
or show the error messages(instructions) if they are not defined in template.sql and/or dbt_project.yml.

gfunc · 2023-09-06T01:39:33Z

Hi @genzgd any update on the review of this MR, or any thoughts/suggestions?

update aliase test

zli06160 · 2023-09-19T20:37:54Z

dbt/include/clickhouse/macros/adapters.sql

+        group by name, schema, type, db_engine
+      {%- else -%}
+        0 as is_on_cluster
+          from system.tables as t JOIN system.databases as db on t.database = db.name


Maybe join in lower case. Same comments for other sql keywords in several files.

zli06160 · 2023-09-19T20:56:19Z

tests/integration/adapter/incremental/test_distributed_incremental.py

+from dbt.tests.util import run_dbt
+
+from tests.integration.adapter.incremental.test_incremental import uniq_schema
+


Maybe add more comments to describe the tests more clearly 1) at the beginning of the .py file, or 2) at the beginning of each test.

sure, will do.

I am afraid most tests are self explainatory by names, and mostly base off dbt's sample tests. I will try to add more comments for distributed materialization tests.

zli06160 · 2023-09-19T21:05:19Z

tests/integration/adapter/incremental/test_distributed_incremental.py

+      engine: ReplicatedMergeTree('/clickhouse/tables/{uuid}/one_shard', '{server_index}' )
+  - name: added
+    config:
+      engine: ReplicatedMergeTree('/clickhouse/tables/{uuid}/one_shard', '{server_index}' )


Maybe it will be better if test at least once with multiple shards x multiple replicas?
e.g. ReplicatedMergeTree('/clickhouse/tables/{shard}/{database}/table_name', '{replica}') (example from https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replication)

for integration tests, if we managed to set a multiple shards * multple replicas matrix, data will always exist in one shard (same zk path), maybe it is not worth adding another test?

For me it is an usual usage.
But it is more managed by dbt-core & the clickhouse cluster, so any integration test(in the current project) on the cluster mode is enough.

BTW, I did not get the reason why the data persist in one shard.
Do you use drop table distributed_table_name on cluster cluster_name or create or replace table, or something similar?

I meant for the Replicated Engine tables with non-distributed materializations, like this seed. If I understand it correctly, during our tests, insert into query will be performed by one node and data will be replicated among nodes with the same shard (same zk path).

Yep, if create (non-distributed)table xxx on cluster xxx then insert into the non-distributed table, the data will on only one shard(and its replicated shards); and the next insert into can target to another shard😿.

For the seeding step, I see 3 possible ways:

insert into a distributed table, which is based on a table with any replicated engine;

or use a view as interface;

or use File Table Engine <= the best in my opinion.

My approach is to make sure seed are using the same shard and different replica, i.e. to broadcast the whole dataset.
I have never used File Table Engine before, could you explain a bit more? How it would help in this situation?

File Table Engine(or just file function) reads the csv files directly, not sure if it will help.
https://clickhouse.com/docs/en/engines/table-engines/special/file
https://clickhouse.com/docs/en/sql-reference/table-functions/file

I think now I understand how you test the distributed_table. I still recommend a separate .py for each main materialization, or, if not applicable, document more.

zli06160 · 2023-09-19T21:44:00Z

The distributed_table works well on the cluster with my test case (several dbt run, no error, correct count(*) );
The distributed_incremental works well on the cluster with my test case (several dbt run, no error, correct count(*) ).

gfunc · 2023-09-20T10:50:01Z

Hi @zli06160 any thoughts on the logic for should_on_cluster attribute in relation.py?

    @property
    def should_on_cluster(self) -> bool:
        if self.include_policy.identifier:
            return self.can_on_cluster
        else:
            # create database/schema on cluster by default
            return True
    @classmethod
    def get_on_cluster(
        cls: Type[Self], cluster: str = '', materialized: str = '', engine: str = ''
    ) -> bool:
        if cluster:
            return 'view' == materialized or 'distributed' in materialized or 'Replicated' in engine
        else:
            return False

I think we should start the discussion early since this is changing the behavior of this adapter once a cluster is set. Also, I tried to make logic compatible with the previous version by adding the attribute can_on_cluster for existing models.

Maybe it is worth mentioning in README?

zli06160 · 2023-09-20T21:42:06Z

For me, the query must contain the on cluster clause: 1) if the connection is to a cluster; 2)or if any keyword like distributed/Replicated/etc detected in the .sql file.

The logic of get_on_cluster seems correct, just need to include MATERIALIZED VIEW (of course this materialization is not implemented yet), and prevent the case cluster = ' '.

Other contributors do need more detailed comments and README to understand the project-specific notions.

gfunc · 2023-10-08T08:16:13Z

I will be available starting tomorrow (Oct 09) for further modifications.

zli06160 · 2023-10-14T09:20:40Z

tests/integration/adapter/test_basic.py

@@ -9,7 +11,13 @@
 from dbt.tests.adapter.basic.test_singular_tests import BaseSingularTests
 from dbt.tests.adapter.basic.test_snapshot_check_cols import BaseSnapshotCheckCols


I think it is better to set "DBT_CH_TEST_CLUSTER"(I guess the value should be 'test_shard'?) somewhere for integration testing.

After I cloned the project, checkout to the branch, build + up the docker containers, there is no such variable, so many tests are ignored.

Then, in this file, in order to fully run the tests, with os.environ["DBT_CH_TEST_CLUSTER"] = 'test_shard' added, there is an error related to BaseSnapshotCheckCols:

09:11:29 Database Error in snapshot cc_all_snapshot (snapshots/cc_all_snapshot.sql) 09:11:29 :HTTPDriver for http://localhost:8123 returned response code 400) 09:11:29 Code: 15. DB::Exception: Column dbt_change_type specified more than once. (DUPLICATE_COLUMN) (version 23.9.1.1854 (official build))

Yes, I am using test_shard for unit tests. And I added the env var in github workflow file.
And tests returned no error. I will take a look at this BaseSnapshotCheckCols.

I am not seeing any problem with this test.env file

DBT_CH_TEST_USE_DOCKER=true DBT_MACRO_DEBUGGING=true DBT_CH_TEST_INCLUDE_S3=false DBT_CH_TEST_CLUSTER=test_shard

and command

pytest tests/integration/adapter/test_basic.py::TestSnapshotCheckCols

result attached:

============================= test session starts ============================== platform linux -- Python 3.9.18, pytest-7.4.0, pluggy-1.2.0 rootdir: /home/xxx/projects/dbt-clickhouse configfile: pytest.ini plugins: dotenv-0.5.2 collected 1 item tests/integration/adapter/test_basic.py . [100%] ============================== 1 passed in 19.58s ==============================

Please share your env file for me to reproduce

The reason is that I didn't have the test.env.

With the same env file as yours,

pytest tests/integration/adapter/test_basic.py -c pytest.ini -s

=> everything is ok after double check test by test.

pytest tests/integration/ -c pytest.ini

=> 107 passed, 2 skipped, 2 warnings .

zli06160

OK for me 👍

genzgd · 2023-10-26T22:46:16Z

Thanks for all the work here, @gfunc and @zli06160. I'll do a 1.4.9 release shortly.

* added can_on_cluster var in ClickhouseRelation * add tests for cluster * fix lint issue * debug set cluster env variable * debug test * debug and add tests * skip distributed table grant test * debug workflow * debug workflow * debug test * add tests fro distributed_incremental * fix zk path error * fix wrong alias for distributed materializations update aliase test * update base on review

* fix: SYSTEM SYNC REPLICA for on_cluster_clause (ClickHouse#156) * fix SYSTEM SYNC REPLICA * add schema * Update version and pypi job * Fix incompatible return type (ClickHouse#162) * Distributed table materialization (ClickHouse#163) * distributed table materialization * fix rebase * PR fixes * Bump version * Tweak PyPI build Python release * Add space to exchange_tables_atomic macro (ClickHouse#168) * Add space to exchange_tables_atomic macro This changes the SYSTEM SYNC REPLICA query to have a space between the ON CLUSTER clause and the table name. * Move whitespace to on_cluster_clause * Fix bad logging/error handling (ClickHouse#170) * Distributed incremental materialization (ClickHouse#172) * distributed table materialization * fix rebase * PR fixes * distributed incremental materialization * fix * fix * add insert_distributed_sync to README.md * add checks on insert_distributed_sync * add checks on insert_distributed_sync * review fixes * Update version and tweak docs * Lw delete set fix (ClickHouse#174) * Move lightweight delete settings to per query for HTTP stickiness fix * Minor cleanup and doc updates * Fix legacy incremental materialization (ClickHouse#178) * fix: distributed_table materialization issue (ClickHouse#184) * Bump version and changelog (ClickHouse#185) * cluster names containing dash characters (ClickHouse#198) (ClickHouse#200) Co-authored-by: the4thamigo-uk <the4thamigo-uk> * Add basic error test, fix minor merge conflict (ClickHouse#202) * Cluster setting and Distributed Table tests (ClickHouse#186) * added can_on_cluster var in ClickhouseRelation * add tests for cluster * fix lint issue * debug set cluster env variable * debug test * debug and add tests * skip distributed table grant test * debug workflow * debug workflow * debug test * add tests fro distributed_incremental * fix zk path error * fix wrong alias for distributed materializations update aliase test * update base on review * Update version and CHANGELOG, incorporate cluster name fix (ClickHouse#203) * Release 1 5 0 (ClickHouse#210) * Initial 1.5.0 commit * Reorganize basic tests * Fix lint * Add case sensitive cache * Fix s3 bucket bug * Checkpoint for constraints/contracts * Fix native column query * Loosen replication test * Checkpoint for constraints tests * Checkpoint for constraints tests * Add rendering of model level CHECK constraints * Fix lint * Reorganize test files * Add one hooks test * Fix lint * Update test and dependency versions. (ClickHouse#211) * Adjust the wrapper parenthesis around the table materialization sql code (ClickHouse#212) * Update for 1.5.1 bug fix * Fix creation of replicated tables when using legacy materialization (ClickHouse#208) * On cluster sync cleanup * Bug fixes related to model settings. (ClickHouse#214) * Add materialization macro for materialized view (ClickHouse#207) * Add materialization macro for materialized view * fix isort issues in materialized view test * Release 1 6 0 (ClickHouse#215) * Initial dbt 1.6 update * Add skipped clone test * Clean up MV PR * Release 1 6 1 (ClickHouse#217) * Identifier quoting checkpoint * Identifier quoting checkpoint * Fix distributed table local quoting * Fix issues with deduplication settings * Release 1 6 2 (ClickHouse#219) * Limited fix to completely broken `on_schema_change` * Tweak changelog * Release 1 7 0 (ClickHouse#220) * Initial dependency updates for 1.7.x * Initial dependency updates for 1.7.x * Correctly warn or error if light weight deletes not available * Wrap columns_in_query query in select statement (ClickHouse#222) * Wrap columns_in_query query in select statement * formatting * Update changelog * allows to add a comment in table's or view's metadata * add settings_section flag as comment for code using settings * override test sql macro and add limit-placer macro * update CHANGELOG.md * fix: use correct schema for MV target tables (ClickHouse#244) * fix: use correct schema when updating MVs The existing implementation passes just the name for `target_table`, which ultimately means that the target schema is not included when the final SQL is generated. By passing the entire relation object, the correct target schema will be present in the final SQL. * update MV tests Provide a custom schema to make sure that the full target table name (schema + relation name) is included in the CREATE MATERIALIZED VIEW statement * Update changelog * rename end of query flag * Bug/223 relationship test with limit (ClickHouse#245) * add settings_section flag as comment for code using settings * override test sql macro and add limit-placer macro * update CHANGELOG.md * rename end of query flag * Revert "Bug/223 relationship test with limit (ClickHouse#245)" (ClickHouse#247) This reverts commit d8afb93. * always return --end_of_sql when asking for settings * Add model settings based on materialization type * support setting clause on view creation * edit CHANGELOG.md * Bump version and tweak changelog * change list syntax to satisfy lint test * change list syntax to satisfy lint test * change imports order to satisfy lint test * Add typing to satisfy lint * Add snapshot materialization to default settings * Fix tests - add distributed_table and distributed_incremental materializations * Fix tests - make sure to call the get_model_settings only when materialization is view * clean up recent changelog * Add materialization macro for dictionaries * address lint issue in dictionary test * address lint issue with enum * Fix model settings with custom materialization * Release 1.7.4 housekeeping (ClickHouse#261) * Bump black from 23.11.0 to 24.3.0 (ClickHouse#259) Bumps [black](https://github.com/psf/black) from 23.11.0 to 24.3.0. - [Release notes](https://github.com/psf/black/releases) - [Changelog](https://github.com/psf/black/blob/main/CHANGES.md) - [Commits](psf/black@23.11.0...24.3.0) --- updated-dependencies: - dependency-name: black dependency-type: direct:development ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Release 1 7 5 (ClickHouse#265) * Release 1.7.5 housekeeping * Upgrade setuptools requirement for clickhouse_driver install * Remove flake8 checks for the moment * Update workflow actions * Fix black comma * fix(clients): add newlines around subquery when retrieving columns to avoid a syntax error (ClickHouse#262) * Fix lint * lazy load agate (ClickHouse#263) * feat: add TTL support (ClickHouse#254) * Fix lint * Update table relation after exchange command (ClickHouse#230) Related to ClickHouse#226 * feat: allow to add connection overrides for dictionaries (ClickHouse#267) * Housekeeping for 1.7.6 release (ClickHouse#268) * Revert "allows to add a comment in table's or view's metadata" * Fix bool_or behavior (ClickHouse#270) * feat: support column codecs * Use Column.data_type in ClickHouseAdapter.format_columns * Always apply query_settings in clickhouse__insert_into macro * Add ClickHouseColumn.is_low_cardinality * Update column type test cases for LowCardinality * Omit empty dictionary connection_overrides from materialization DDL --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Schum <68906108+Schum-io@users.noreply.github.com> Co-authored-by: Geoff Genz <geoff@clickhouse.com> Co-authored-by: Sergey Reshetnikov <resh.sersh@gmail.com> Co-authored-by: gladkikhtutu <88535677+gladkikhtutu@users.noreply.github.com> Co-authored-by: Damir Basic Knezevic <damirbasicknezevic@gmail.com> Co-authored-by: Zhenbang <122523068+zli06160@users.noreply.github.com> Co-authored-by: Andy <email@elevatesystems.co.uk> Co-authored-by: gfunc <fcjchaojian@gmail.com> Co-authored-by: Kristof Szaloki <szalokikristof@gmail.com> Co-authored-by: Steven Reitsma <4895139+StevenReitsma@users.noreply.github.com> Co-authored-by: Rory Sawyer <rory@sawyer.dev> Co-authored-by: ptemarvelde <45282601+ptemarvelde@users.noreply.github.com> Co-authored-by: Dmitrii Tcimokha <dstsimokha@gmail.com> Co-authored-by: bentsileviav <bentsi.leviav@clickhouse.com> Co-authored-by: Dmitriy Sokolov <silentsokolov@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: triou <teva.riou@gmail.com> Co-authored-by: Daniel Reeves <31971762+dwreeves@users.noreply.github.com> Co-authored-by: Cristhian Garcia <crisgarta8@gmail.com> Co-authored-by: Thomas Schmidt <somtom91@gmail.com> Co-authored-by: scrawfor <scrawfor@users.noreply.github.com> Co-authored-by: Robin Norgren <68205730+rjoelnorgren@users.noreply.github.com>

gfunc added 5 commits August 23, 2023 07:39

added can_on_cluster var in ClickhouseRelation

568f01d

add tests for cluster

de0c475

fix lint issue

dae1a29

debug set cluster env variable

fe7d320

debug test

b31d681

gfunc added 6 commits August 23, 2023 15:57

debug and add tests

092dae8

skip distributed table grant test

badbbfa

debug workflow

6202d3f

debug workflow

de66c04

debug test

7c042ea

add tests fro distributed_incremental

193668a

gfunc mentioned this pull request Aug 24, 2023

Problem with the distributed_table materialization #179

Closed

fix zk path error

33a31b2

zli06160 reviewed Aug 24, 2023

View reviewed changes

gfunc requested a review from zli06160 September 14, 2023 03:43

fix wrong alias for distributed materializations

1533862

update aliase test

gfunc mentioned this pull request Sep 19, 2023

Parameter 'Alias' doesn't affect local tables if distributed materialized is used. #191

Open

zli06160 reviewed Sep 19, 2023

View reviewed changes

update base on review

4e59222

gfunc requested a review from zli06160 October 13, 2023 04:56

zli06160 reviewed Oct 14, 2023

View reviewed changes

zli06160 approved these changes Oct 14, 2023

View reviewed changes

zli06160 mentioned this pull request Oct 17, 2023

cluster names containing dash characters (#198) #200

Merged

3 tasks

genzgd merged commit 96474f1 into ClickHouse:main Oct 26, 2023
21 checks passed

genzgd mentioned this pull request Nov 30, 2023

First version of distributed materialization #15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster setting and Distributed Table tests #186

Cluster setting and Distributed Table tests #186

gfunc commented Aug 23, 2023 •

edited

Loading

CLAassistant commented Aug 23, 2023 •

edited

Loading

zli06160 left a comment •

edited

Loading

gfunc commented Sep 6, 2023 •

edited

Loading

zli06160 Sep 19, 2023

zli06160 Sep 19, 2023

gfunc Sep 20, 2023

gfunc Oct 9, 2023

zli06160 Sep 19, 2023

gfunc Oct 9, 2023

zli06160 Oct 9, 2023 •

edited

Loading

zli06160 Oct 10, 2023

gfunc Oct 10, 2023 •

edited

Loading

zli06160 Oct 11, 2023

gfunc Oct 12, 2023

zli06160 Oct 12, 2023 •

edited

Loading

zli06160 Oct 14, 2023

zli06160 commented Sep 19, 2023 •

edited

Loading

gfunc commented Sep 20, 2023

zli06160 commented Sep 20, 2023 •

edited

Loading

gfunc commented Oct 8, 2023 •

edited

Loading

zli06160 Oct 14, 2023 •

edited

Loading

gfunc Oct 14, 2023 •

edited

Loading

gfunc Oct 14, 2023

zli06160 Oct 14, 2023

zli06160 left a comment

genzgd commented Oct 26, 2023

		from dbt.tests.util import run_dbt

		from tests.integration.adapter.incremental.test_incremental import uniq_schema

		@@ -9,7 +11,13 @@
		from dbt.tests.adapter.basic.test_singular_tests import BaseSingularTests
		from dbt.tests.adapter.basic.test_snapshot_check_cols import BaseSnapshotCheckCols

Cluster setting and Distributed Table tests #186

Cluster setting and Distributed Table tests #186

Conversation

gfunc commented Aug 23, 2023 • edited Loading

Summary

Checklist

Caveats

CLAassistant commented Aug 23, 2023 • edited Loading

zli06160 left a comment • edited Loading

Choose a reason for hiding this comment

gfunc commented Sep 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zli06160 Oct 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfunc Oct 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zli06160 Oct 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zli06160 commented Sep 19, 2023 • edited Loading

gfunc commented Sep 20, 2023

zli06160 commented Sep 20, 2023 • edited Loading

gfunc commented Oct 8, 2023 • edited Loading

zli06160 Oct 14, 2023 • edited Loading

Choose a reason for hiding this comment

gfunc Oct 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zli06160 left a comment

Choose a reason for hiding this comment

genzgd commented Oct 26, 2023

gfunc commented Aug 23, 2023 •

edited

Loading

CLAassistant commented Aug 23, 2023 •

edited

Loading

zli06160 left a comment •

edited

Loading

gfunc commented Sep 6, 2023 •

edited

Loading

zli06160 Oct 9, 2023 •

edited

Loading

gfunc Oct 10, 2023 •

edited

Loading

zli06160 Oct 12, 2023 •

edited

Loading

zli06160 commented Sep 19, 2023 •

edited

Loading

zli06160 commented Sep 20, 2023 •

edited

Loading

gfunc commented Oct 8, 2023 •

edited

Loading

zli06160 Oct 14, 2023 •

edited

Loading

gfunc Oct 14, 2023 •

edited

Loading