Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid setting when creating temporary tables with ON CLUSTER #366

Open
shachibista opened this issue Oct 10, 2024 · 5 comments
Open

Invalid setting when creating temporary tables with ON CLUSTER #366

shachibista opened this issue Oct 10, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@shachibista
Copy link

shachibista commented Oct 10, 2024

Hi!

When trying to create a temporary table with the cluster property set, creating temporary tables with Engine=Memory passes the replicated_deduplication_window setting which is invalid. It seems that the replicated_deduplication_window setting only checks for the materialization type but does not check the Engine which it should.

I think that the update_model_settings method should accept engine type and only allow settings that are available for each engine.

Seeing this issue when trying to run https://github.com/elementary-data/dbt-data-reliability because it tries to create a temporary table when artifacts change.

Steps to reproduce

  1. Setup a clickhouse multi-node cluster.
  2. Configure with `cluster: "{cluster}"' in profiles
  3. Run the elementary bootstrap model dbt run -m elementary

Expected behaviour

No errors should be expected. The temporary table should be created in the connected node in memory.

Code examples, such as models or profile settings

# profiles.yml
project_name:
  target: local
  outputs:
    local:
      type: clickhouse
      driver: native
      schema: "{{ env_var('DBT_SCHEMA') }}"
      user: "{{ env_var('DBT_USER') }}"
      password: "{{ env_var('DBT_PASS') }}"
      port: 9000
      cluster: "{cluster}"
      cluster_mode: true
      check_exchange: false
# packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: 1.2.0

packages:
  - package: elementary-data/elementary
    version: 0.15.2
# dbt_project.yml
name: 'project_name'
version: '1.0.0'

profile: 'project_name'

model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
docs-paths: ["docs"]

clean-targets: 
  - "target"
  - "dbt_packages"

dispatch:
  - macro_namespace: elementary
    search_order: ['project_name', 'elementary']

models:
  +engine: ReplicatedMergeTree

  elementary:
    +engine: ReplicatedMergeTree
    edr:
      system:
        enabled: false
      data_monitoring:
        enabled: false # Doesn't yet work because stddev is not available and cannot be overridden.

seeds:
  +engine: ReplicatedMergeTree

flags:
  require_explicit_package_overrides_for_builtin_materializations: false
  source_freshness_run_project_hooks: true
# macros/current_timestamp.sql
{% macro clickhouse__edr_current_timestamp() -%}
    now(timezone())
{%- endmacro -%}

{% macro clickhouse__edr_current_timestamp_in_utc() -%}
    now()
{%- endmacro -%}

dbt and/or ClickHouse server logs

�[0m17:24:58.632265 [debug] [Thread-1 (]: dbt_clickhouse adapter: Error running SQL: /* {"app": "dbt", "dbt_version": "1.8.6", "profile_name": "opti_analytics", "target_name": "local", "node_id": "model.elementary.dbt_models"} */

    
  
    
    
    

    create temporary table `dbt_models__tmp_20241010152458544783`
        engine Memory
        
                    -- end_of_sql
                    SETTINGS  replicated_deduplication_window=0

                    
        as (
          
        SELECT
        
            *
        
        FROM `project_name`.`dbt_models`
        WHERE 1 = 0
    
        )
  
  
�[0m17:24:58.645419 [debug] [Thread-1 (]: Database Error in model dbt_models (models/edr/dbt_artifacts/dbt_models.sql)
  Code: 115.
  DB::Exception: Unknown setting 'replicated_deduplication_window': for storage Memory. Stack trace:
  
  0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000bc3f2c8
  1. DB::Exception::Exception(PreformattedMessage&&, int) @ 0x0000000007a8dc7c
  2. DB::Exception::Exception<String>(int, FormatStringHelperImpl<std::type_identity<String>::type>, String&&) @ 0x0000000007a8d858
  3. DB::BaseSettingsHelpers::throwSettingNotFound(std::basic_string_view<char, std::char_traits<char>>) @ 0x000000000ea38e4c
  4. DB::BaseSettings<DB::memorySettingsTraits>::set(std::basic_string_view<char, std::char_traits<char>>, DB::Field const&) @ 0x000000000f1442c4
  5. DB::MemorySettings::loadFromQuery(DB::ASTStorage&) @ 0x000000000ff94e40
  6. std::shared_ptr<DB::IStorage> std::__function::__policy_invoker<std::shared_ptr<DB::IStorage> (DB::StorageFactory::Arguments const&)>::__call_impl<std::__function::__default_alloc_func<DB::registerStorageMemory(DB::StorageFactory&)::$_0, std::shared_ptr<DB::IStorage> (DB::StorageFactory::Arguments const&)>>(std::__function::__policy_storage const*, DB::StorageFactory::Arguments const&) (.llvm.1794116773610321499) @ 0x000000001008a0f8
  7. DB::StorageFactory::get(DB::ASTCreateQuery const&, String const&, std::shared_ptr<DB::Context>, std::shared_ptr<DB::Context>, DB::ColumnsDescription const&, DB::ConstraintsDescription const&, DB::LoadingStrictnessLevel) const @ 0x000000000fffac7c
  8. std::shared_ptr<DB::IStorage> std::__function::__policy_invoker<std::shared_ptr<DB::IStorage> (DB::StorageID const&)>::__call_impl<std::__function::__default_alloc_func<DB::InterpreterCreateQuery::doCreateTable(DB::ASTCreateQuery&, DB::InterpreterCreateQuery::TableProperties const&, std::unique_ptr<DB::DDLGuard, std::default_delete<DB::DDLGuard>>&, DB::LoadingStrictnessLevel)::$_0, std::shared_ptr<DB::IStorage> (DB::StorageID const&)>>(std::__function::__policy_storage const*, DB::StorageID const&) @ 0x000000000f609ac4
  9. DB::TemporaryTableHolder::TemporaryTableHolder(std::shared_ptr<DB::Context const>, std::function<std::shared_ptr<DB::IStorage> (DB::StorageID const&)> const&, std::shared_ptr<DB::IAST> const&) @ 0x000000000f12ed98
  10. DB::InterpreterCreateQuery::doCreateTable(DB::ASTCreateQuery&, DB::InterpreterCreateQuery::TableProperties const&, std::unique_ptr<DB::DDLGuard, std::default_delete<DB::DDLGuard>>&, DB::LoadingStrictnessLevel) @ 0x000000000f6011b0
  11. DB::InterpreterCreateQuery::createTable(DB::ASTCreateQuery&) @ 0x000000000f5fbeec
  12. DB::InterpreterCreateQuery::execute() @ 0x000000000f607660
  13. DB::executeQueryImpl(char const*, char const*, std::shared_ptr<DB::Context>, DB::QueryFlags, DB::QueryProcessingStage::Enum, DB::ReadBuffer*) @ 0x000000000fbf0d58
  14. DB::executeQuery(String const&, std::shared_ptr<DB::Context>, DB::QueryFlags, DB::QueryProcessingStage::Enum) @ 0x000000000fbed2a8
  15. DB::TCPHandler::runImpl() @ 0x0000000010a02904
  16. DB::TCPHandler::run() @ 0x0000000010a19348
  17. Poco::Net::TCPServerConnection::start() @ 0x0000000012d464b8
  18. Poco::Net::TCPServerDispatcher::run() @ 0x0000000012d46994
  19. Poco::PooledThread::run() @ 0x0000000012e3e718
  20. Poco::ThreadImpl::runnableEntry(void*) @ 0x0000000012e3cb10
  21. start_thread @ 0x0000000000007624
  22. ? @ 0x00000000000d162c

Configuration

Environment

  • dbt version: 1.8.6
  • dbt-clickhouse version: 1.8.7
  • clickhouse-driver version (if using native): 24.4.1.914
  • Python version: Python 3.11.10
  • Operating system: Running on clickhouse-server docker image.

ClickHouse server

  • ClickHouse Server version: 24.5.8
  • CREATE TABLE statements for tables involved: See above
@shachibista shachibista added the bug Something isn't working label Oct 10, 2024
@genzgd
Copy link
Contributor

genzgd commented Oct 10, 2024

Is there any reason you can't just use a MergeTree table? Temporary tables have all kinds of gotchas, especially in a clustered environment, in ClickHouse Cloud, or if your server is behind a load balancer.

@shachibista
Copy link
Author

Is there any reason you can't just use a MergeTree table? Temporary tables have all kinds of gotchas, especially in a clustered environment, in ClickHouse Cloud, or if your server is behind a load balancer.

We have a multi-cluster node so I would like the uploaded results to be accessible on all nodes, otherwise the specific node must be configured for all clients. These tables are created as an intermediate by elementary.

Besides, even using MergeTree does not solve the actual issue which is a bug. The temporary tables are being created by the elementary package and not explicitly by my dbt project. Currently it is the most compatible observability toolkit for dbt, since dbt_artifacts does not support ClickHouse; the alternative is to create our own package which I would like to avoid.

@genzgd
Copy link
Contributor

genzgd commented Oct 10, 2024

That sounds like an incompatibility in the elementary package. The data in a temporary (in memory) table on one node will not be available on other nodes in the cluster. So anything you do in a temporary table will be by definition "not accessible on all nodes", and if that package requires temporary tables it's extremely unlikely it will work on a clustered ClickHouse installation.

@shachibista
Copy link
Author

That sounds like an incompatibility in the elementary package.

I am not contesting this but seems like a red herring in this context. Either clickhouse__create_table_as should not allow the temporary flag altogether or it should parse the settings and pass them correctly (which is the bug described in this issue).

The data in a temporary (in memory) table on one node will not be available on other nodes in the cluster. So anything you do in a temporary table will be by definition "not accessible on all nodes", and if that package requires temporary tables it's extremely unlikely it will work on a clustered ClickHouse installation.

The final tables are created in a clustered table, however the intermediate results are stored in a supposedly temporary table. The flow is that run_results.json are uploaded to a node (any node), it is transformed in that same node (supposedly in a temporary table) and the results are written to a clustered table. Since the artifacts is ok to be localized and so small it does not matter which node runs it, what matters is where the final results end up which in this case happens to be a ReplicatedMergeTree.

It would be really great if this could be resolved since it is seemingly the only issue preventing some (limited) obeservability on our dbt stack using elementary and ClickHouse.

@genzgd
Copy link
Contributor

genzgd commented Oct 11, 2024

Fair point on the temporary table code. That long predates cluster functionality and is definitely a likely source of future bugs when dbt either runs in cluster mode or hits a different back end ClickHouse node due to load balancing, etc. Nevertheless, you might want to open an issue in the elementary project to allow a mode that doesn't rely on temporary tables (and just cleans up intermediate tables like dbt-clickhouse does for incremental models).

You are correct, it's technically a bug. It might get fixed faster if you submit a PR. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants