Skip to content

Commit

Permalink
Merge branch 'current' into dbeatty10-patch-2
Browse files Browse the repository at this point in the history
  • Loading branch information
dbeatty10 authored Oct 12, 2024
2 parents 13eec36 + 8d36029 commit f6a9383
Show file tree
Hide file tree
Showing 5 changed files with 56 additions and 22 deletions.
23 changes: 16 additions & 7 deletions website/docs/docs/build/incremental-microbatch.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Each "batch" corresponds to a single bounded time period (by default, a single d

### Example

A `sessions` model is aggregating and enriching data that comes from two other models:
A `sessions` model aggregates and enriches data that comes from two other models.
- `page_views` is a large, time-series table. It contains many rows, new records almost always arrive after existing ones, and existing records rarely update.
- `customers` is a relatively small dimensional table. Customer attributes update often, and not in a time-based manner — that is, older customers are just as likely to change column values as newer customers.

Expand All @@ -39,12 +39,15 @@ models:
event_time: page_view_start
```
</File>
We run the `sessions` model on October 1, 2024, and then again on October 2. It produces the following queries:

<Tabs>

<TabItem value="Model definition">

The `event_time` for the `sessions` model is set to `session_start`, which marks the beginning of a user’s session on the website. This setting allows dbt to combine multiple page views (each tracked by their own `page_view_start` timestamps) into a single session. This way, `session_start` differentiates the timing of individual page views from the broader timeframe of the entire user session.

<File name="models/sessions.sql">

```sql
Expand All @@ -70,7 +73,13 @@ customers as (
),
...
select
page_views.id as session_id,
page_views.page_view_start as session_start,
customers.*
from page_views
left join customers
on page_views.customer_id = customer.id
```

</File>
Expand Down Expand Up @@ -141,7 +150,7 @@ customers as (

dbt will instruct the data platform to take the result of each batch query and insert, update, or replace the contents of the `analytics.sessions` table for the same day of data. To perform this operation, dbt will use the most efficient atomic mechanism for "full batch" replacement that is available on each data platform.

It does not matter whether the table already contains data for that day, or not. Given the same input data, no matter how many times a batch is reprocessed, the resulting table is the same.
It does not matter whether the table already contains data for that day. Given the same input data, the resulting table is the same no matter how many times a batch is reprocessed.

<Lightbox src="/img/docs/building-a-dbt-project/microbatch/microbatch_filters.png" title="Each batch of sessions filters page_views to the matching time-bound batch, but doesn't filter sessions, performing a full scan for each batch."/>

Expand Down Expand Up @@ -175,11 +184,11 @@ During standard incremental runs, dbt will process batches according to the curr

<Lightbox src="/img/docs/building-a-dbt-project/microbatch/microbatch_lookback.png" title="Configure a lookback to reprocess additional batches during standard incremental runs"/>

**Note:** If there’s an upstream model that configures `event_time`, but you *don’t* want the reference to it to be filtered, you can specify `ref('upstream_model').render()` to opt-out of auto-filtering. This isn't generally recommended — most models which configure `event_time` are fairly large, and if the reference is not filtered, each batch will perform a full scan of this input table.
**Note:** If there’s an upstream model that configures `event_time`, but you *don’t* want the reference to it to be filtered, you can specify `ref('upstream_model').render()` to opt-out of auto-filtering. This isn't generally recommended — most models that configure `event_time` are fairly large, and if the reference is not filtered, each batch will perform a full scan of this input table.

### Backfills

Whether to fix erroneous source data, or retroactively apply a change in business logic, you may need to reprocess a large amount of historical data.
Whether to fix erroneous source data or retroactively apply a change in business logic, you may need to reprocess a large amount of historical data.

Backfilling a microbatch model is as simple as selecting it to run or build, and specifying a "start" and "end" for `event_time`. As always, dbt will process the batches between the start and end as independent queries.

Expand All @@ -204,7 +213,7 @@ For now, dbt assumes that all values supplied are in UTC:
- `--event-time-start`
- `--event-time-end`

While we may consider adding support for custom timezones in the future, we also believe that defining these values in UTC makes everyone's lives easier.
While we may consider adding support for custom time zones in the future, we also believe that defining these values in UTC makes everyone's lives easier.

## How `microbatch` compares to other incremental strategies?

Expand Down Expand Up @@ -261,7 +270,7 @@ select * from {{ ref('stg_events') }} -- this ref will be auto-filtered

</File>

Where you’ve also set an `event_time` for the model’s direct parents - in this case `stg_events`:
Where you’ve also set an `event_time` for the model’s direct parents - in this case, `stg_events`:

<File name="models/staging/stg_events.yml">

Expand Down
4 changes: 2 additions & 2 deletions website/docs/docs/collaborate/govern/model-contracts.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,14 +178,14 @@ Currently, `not_null` and `check` constraints are enforced only after a model is
### Which models should have contracts?

Any model meeting the criteria described above _can_ define a contract. We recommend defining contracts for ["public" models](model-access) that are being relied on downstream.
- Inside of dbt: Shared with other groups, other teams, and (in the future) other dbt projects.
- Inside of dbt: Shared with other groups, other teams, and [other dbt projects](/best-practices/how-we-mesh/mesh-1-intro).
- Outside of dbt: Reports, dashboards, or other systems & processes that expect this model to have a predictable structure. You might reflect these downstream uses with [exposures](/docs/build/exposures).

### How are contracts different from tests?

A model's contract defines the **shape** of the returned dataset. If the model's logic or input data doesn't conform to that shape, the model does not build.

[Data Tests](/docs/build/data-tests) are a more flexible mechanism for validating the content of your model _after_ it's built. So long as you can write the query, you can run the data test. Data tests are more configurable, such as with [custom severity thresholds](/reference/resource-configs/severity). They are easier to debug after finding failures, because you can query the already-built model, or [store the failing records in the data warehouse](/reference/resource-configs/store_failures).
[Data Tests](/docs/build/data-tests) are a more flexible mechanism for validating the content of your model _after_ it's built. So long as you can write the query, you can run the data test. Data tests are more configurable, such as with [custom severity thresholds](/reference/resource-configs/severity). They are easier to debug after finding failures because you can query the already-built model, or [store the failing records in the data warehouse](/reference/resource-configs/store_failures).

In some cases, you can replace a data test with its equivalent constraint. This has the advantage of guaranteeing the validation at build time, and it probably requires less compute (cost) in your data platform. The prerequisites for replacing a data test with a constraint are:
- Making sure that your data platform can support and enforce the constraint that you need. Most platforms only enforce `not_null`.
Expand Down
25 changes: 25 additions & 0 deletions website/docs/docs/dbt-versions/release-notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,31 @@ Release notes are grouped by month for both multi-tenant and virtual private clo

## October 2024

<Expandable alt_header="Coalesce 2024 announcements">

Documentation for new features and functionality announced at Coalesce 2024:

- Iceberg table support for [Snowflake](https://docs.getdbt.com/reference/resource-configs/snowflake-configs#iceberg-table-format)
- [Athena](https://docs.getdbt.com/reference/resource-configs/athena-configs) and [Teradata](https://docs.getdbt.com/reference/resource-configs/teradata-configs) adapter support in dbt Cloud
- dbt Cloud now hosted on [Azure](https://docs.getdbt.com/docs/cloud/about-cloud/access-regions-ip-addresses)
- Get comfortable with [Versionless dbt Cloud](https://docs.getdbt.com/docs/dbt-versions/versionless-cloud)
- Scalable [microbatch incremental models](https://docs.getdbt.com/docs/build/incremental-microbatch)
- Advanced CI [features](https://docs.getdbt.com/docs/deploy/advanced-ci)
- [Linting with CI jobs](https://docs.getdbt.com/docs/deploy/continuous-integration#sql-linting)
- dbt Assist is now [dbt Copilot](https://docs.getdbt.com/docs/cloud/dbt-copilot)
- Developer blog on [Snowflake Feature Store and dbt: A bridge between data pipelines and ML](https://docs.getdbt.com/blog/snowflake-feature-store)
- New [Quickstart for dbt Cloud CLI](https://docs.getdbt.com/guides/dbt-cloud-cli?step=1)
- [Auto-exposures with Tableau](https://docs.getdbt.com/docs/collaborate/auto-exposures)
- Semantic Layer integration with [Excel desktop and M365](https://docs.getdbt.com/docs/cloud-integrations/semantic-layer/excel)
- [Data health tiles](https://docs.getdbt.com/docs/collaborate/data-tile)
- [Semantic Layer and Cloud IDE integration](https://docs.getdbt.com/docs/build/metricflow-commands#metricflow-commands)
- Query history in [Explorer](https://docs.getdbt.com/docs/collaborate/model-query-history#view-query-history-in-explorer)
- Semantic Layer Metricflow improvements, including [improved granularity and custom calendar](https://docs.getdbt.com/docs/build/metricflow-time-spine#custom-calendar)
- [Python SDK](https://docs.getdbt.com/docs/dbt-cloud-apis/sl-python) is now generally available

</Expandable>


- **New**: The [dbt Semantic Layer Python software development kit](/docs/dbt-cloud-apis/sl-python) is now [generally available](/docs/dbt-versions/product-lifecycles). It provides users with easy access to the dbt Semantic Layer with Python and enables developers to interact with the dbt Semantic Layer APIs to query metrics/dimensions in downstream tools.
- **Enhancement**: You can now add a description to a singular data test in dbt Cloud Versionless. Use the [`description` property](/reference/resource-properties/description) to document [singular data tests](/docs/build/data-tests#singular-data-tests). You can also use [docs block](/docs/build/documentation#using-docs-blocks) to capture your test description. The enhancement will be included in upcoming dbt Core 1.9 release.
- **New**: Introducing the [microbatch incremental model strategy](/docs/build/incremental-microbatch) (beta), available in dbt Cloud Versionless and will soon be supported in dbt Core 1.9. The microbatch strategy allows for efficient, batch-based processing of large time-series datasets for improved performance and resiliency, especially when you're working with data that changes over time (like new records being added daily). To enable this feature in dbt Cloud, set the `DBT_EXPERIMENTAL_MICROBATCH` environment variable to `true` in your project.
Expand Down
2 changes: 1 addition & 1 deletion website/docs/reference/node-selection/defer.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ dbt test --models [...] --defer --state path/to/artifacts

When the `--defer` flag is provided, dbt will resolve `ref` calls differently depending on two criteria:
1. Is the referenced node included in the model selection criteria of the current run?
2. Does the reference node exist as a database object in the current environment?
2. Does the referenced node exist as a database object in the current environment?

If the answer to both is **no**—a node is not included _and_ it does not exist as a database object in the current environment—references to it will use the other namespace instead, provided by the state manifest.

Expand Down
24 changes: 12 additions & 12 deletions website/docs/reference/resource-configs/firebolt-configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,8 @@ models:
+table_type: fact
+primary_index: [ <column-name>, ... ]
+indexes:
- type: aggregating
key_column: [ <column-name>, ... ]
- index_type: aggregating
key_columns: [ <column-name>, ... ]
aggregation: [ <agg-sql>, ... ]
...
```
Expand All @@ -58,8 +58,8 @@ models:
table_type: fact
primary_index: [ <column-name>, ... ]
indexes:
- type: aggregating
key_column: [ <column-name>, ... ]
- index_type: aggregating
key_columns: [ <column-name>, ... ]
aggregation: [ <agg-sql>, ... ]
...
```
Expand All @@ -77,9 +77,9 @@ models:
primary_index = [ "<column-name>", ... ],
indexes = [
{
type = "aggregating"
key_column = [ "<column-name>", ... ],
aggregation = [ "<agg-sql>", ... ],
"index_type": "aggregating"
"key_columns": [ "<column-name>", ... ],
"aggregation": [ "<agg-sql>", ... ],
},
...
]
Expand All @@ -99,8 +99,8 @@ models:
| `table_type` | Whether the materialized table will be a [fact or dimension](https://docs.firebolt.io/godocs/Overview/working-with-tables/working-with-tables.html#fact-and-dimension-tables) table. |
| `primary_index` | Sets the primary index for the fact table using the inputted list of column names from the model. Required for fact tables. |
| `indexes` | A list of aggregating indexes to create on the fact table. |
| `type` | Specifies that the index is an [aggregating index](https://docs.firebolt.io/godocs/Guides/working-with-indexes/using-aggregating-indexes.html). Should be set to `aggregating`. |
| `key_column` | Sets the grouping of the aggregating index using the inputted list of column names from the model. |
| `index_type` | Specifies that the index is an [aggregating index](https://docs.firebolt.io/godocs/Guides/working-with-indexes/using-aggregating-indexes.html). Should be set to `aggregating`. |
| `key_columns` | Sets the grouping of the aggregating index using the inputted list of column names from the model. |
| `aggregation` | Sets the aggregations on the aggregating index using the inputted list of SQL agg expressions. |


Expand All @@ -113,9 +113,9 @@ models:
primary_index = "id",
indexes = [
{
type: "aggregating",
key_column: "order_id",
aggregation: ["COUNT(DISTINCT status)", "AVG(customer_id)"]
"index_type": "aggregating",
"key_columns": "order_id",
"aggregation": ["COUNT(DISTINCT status)", "AVG(customer_id)"]
}
]
) }}
Expand Down

0 comments on commit f6a9383

Please sign in to comment.