Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threshold argument for Test Severity #2219

Closed
sanjanasen96 opened this issue Mar 19, 2020 · 7 comments
Closed

Threshold argument for Test Severity #2219

sanjanasen96 opened this issue Mar 19, 2020 · 7 comments
Labels
dbt tests Issues related to built-in dbt testing functionality

Comments

@sanjanasen96
Copy link

Describe the feature

  • A threshold argument for test severity.
  • Useful for more flexibility around test severity. Eg: If the test fails for < 10 IDs, severity can be set to a warning. If the test fails for > 10 IDs, severity can be set to an error.

Describe alternatives you've considered

  • Currently we've been keeping an eye on daily test failures and switching severity around for tests as needed. This is not scalable or efficient.

Additional context

  • This is not database specific.

Who will this benefit?

  • We've found multiple use cases in relationship tests in multi-layer modeling like event modeling.
@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 19, 2020

Worth saying off the bat:

  • Everything discussed in this issue and the comment below can be done today with custom schema tests that override the builtin unique, not_null, etc
  • It would also be really cool if the builtin tests included some sane functionality that points in this direction

I think that becomes all the more relevant as we start to work with bigger datasets. We already run into issues with tiny failures for relationships tests because tables are loaded asynchronously, but it's also important for handling cases where, e.g., a billion-row table has 10 duplicate records.

One complication I'd throw out is absolute vs. relative thresholds. I may want a test to error if the number of failures is >1% of the overall row count, instead of being greater than an arbitrary number.

While not directly related, I also see this working in combination with a where (subset/partition) argument. E.g. I only want to test data from yesterday + today, and I want it to warn or error if there are <=1% or >1% nulls, respectively. Maybe that looks like:

models:
  - name: my_big_model
    columns:
      - name: important_column
        tests:
          - not_null:
              warn_after: {count: 0, percent: true}
              error_after: {count: 1, percent: true}
              where: "date_day >= (current_date() - interval 1 day)"

@drewbanin
Copy link
Contributor

Check out #1173 - I think we could make a change like this in the scope of that issue (or shortly thereafter).

  • I like the idea of the where predicate here -- TBD if that's built-in to dbt, or if we require schema test authors to support it in their test definitions though. I do imagine we could filter the referenced model in a subquery when interpolating the table reference, for example :)
  • I imagine we could calculate percentages pretty easily if we also run a count(*) on the specified model
  • I like warn_after and error_after more than I like severity: true -- it seems a lot more flexible!

@sean-rose
Copy link

I'm also very much in favor of supporting relative thresholds, but for the syntax instead of {count: <number>, percent: true} I'd suggest just having a number followed by a %, and if there is no % treat it as a row count.

- not_null:
    error_after: 10  # Row count

- not_null:
    error_after: 10%  # Relative percentage

@jtcohen6 jtcohen6 added this to the Oh-Twenty [TBD] milestone Nov 13, 2020
@jtcohen6 jtcohen6 added the dbt tests Issues related to built-in dbt testing functionality label Dec 31, 2020
@jtcohen6
Copy link
Contributor

Let's do it! I think these will be new test configs, warn_if + error_if, with logic for detecting % as @sean-rose recommended.

Closing in favor of the concrete proposal in #3258

@jtcohen6 jtcohen6 removed this from the Margaret Mead milestone Apr 13, 2021
@RoyalTS
Copy link

RoyalTS commented May 20, 2022

Sounds like the proposal in #3258 never ended up implementing relative thresholds?

@jtcohen6
Copy link
Contributor

@RoyalTS We're working on this right now! Check out the discussion in #4723, and the WIP in #5172

@gracie2339
Copy link

Hi All, I found a workaround with this by writing a generic test in dbt and then referencing it in the yml files like any other ordinary test:

This one works for a relationships test, has a 1% error threshold, and works for BigQuery. Feel free to change as you see fit.

{% test relationships_with_1_pct_error_threshold(model, column_name, field, to) %}

with parent as (
    select
        {{ field }} as id
    from {{ to }}
),

child as (
    select 
        {{ column_name }} as id
    from {{ model }}
),

error_threshold as (
    select
        cast(count(distinct id) * .01 as int) as num
    from parent
)

select distinct id
from child 
where id is not null
    and id not in (select id from parent)
    and (
        select count(distinct id) 
        from child
        where id is not null
            and id not in (select id from parent)
        ) > (select num from error_threshold)

{% endtest %}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dbt tests Issues related to built-in dbt testing functionality
Projects
None yet
Development

No branches or pull requests

6 participants