Merge branch 'current' into mirnawong1-patch-5

dbt-labs · Jul 20, 2023 · 1a3ac1d · 1a3ac1d
2 parents f770ae9 + 19cc1c9
commit 1a3ac1d
Show file tree

Hide file tree

Showing 17 changed files with 250 additions and 14 deletions.
diff --git a/.github/workflows/autoupdate.yml b/.github/workflows/autoupdate.yml
@@ -2,11 +2,11 @@ name: Auto Update
 
 on:
   # This will trigger on all pushes to all branches.
-  push: {}
+#  push: {}
   # Alternatively, you can only trigger if commits are pushed to certain branches, e.g.:
-  # push:
-  #   branches:
-  #     - current
+  push:
+    branches:
+      - current
   #     - unstable
 jobs:
   autoupdate:

diff --git a/website/blog/2023-07-17-GPT-and-dbt-test.md b/website/blog/2023-07-17-GPT-and-dbt-test.md
@@ -0,0 +1,213 @@
+---
+title: "Create dbt Documentation and Tests 10x faster with ChatGPT"
+description: "You can use ChatGPT to infer the context of verbosely named fields from database table schemas."
+slug: create-dbt-documentation-10x-faster-with-ChatGPT
+
+authors: [pedro_brito_de_sa]
+
+tags: [analytics craft, data ecosystem]
+hide_table_of_contents: true
+
+date: 2023-07-18
+is_featured: true
+---
+
+Whether you are creating your pipelines into dbt for the first time or just adding a new model once in a while, **good documentation and testing should always be a priority** for you and your team. Why do we avoid it like the plague then? Because it’s a hassle having to write down each individual field, its description in layman terms and figure out what tests should be performed to ensure the data is fine and dandy. How can we make this process faster and less painful?
+
+By now, everyone knows the wonders of the GPT models for code generation and pair programming so this shouldn’t come as a surprise. But **ChatGPT really shines** at inferring the context of verbosely named fields from database table schemas. So in this post I am going to help you 10x your documentation and testing speed by using ChatGPT to do most of the leg work for you.
+
+<!--truncate-->
+
+As a one-person Analytics team at [Sage](http://www.hellosage.com/) I had to create our dbt pipelines from the ground up. This meant 30+ tables of internal facts and dimensions + external data into a Staging Layer, plus all of the following layers of augmented models and Mart tables. After the fact, we are talking about 3500+ lines of YAML that I was NOT excited to get started on. Fortunately for me, this was February 2023 and ChatGPT had just come out. And boy, was I glad to have it. After a good dose of “prompt engineering” I managed to get most of my documentation and tests written out, only needing a few extra tweaks.
+
+Writing this article as of July 2023, and now powered by GPT-4 and not GPT 3.5, it is already easier to get the same results I did, so here are my learnings that I hope everyone can replicate.
+
+## Use verbose tables with verbose fields
+
+ChatGPT can only infer so much, so tables with names and fields that resemble encryption keys are unlikely to be good for this approach. In this example we are going to use this table:
+
+```sql
+create or replace TRANSIENT TABLE STAGING.BASE.STG_STAFF_MEMBER (
+      ID NUMBER(38,0),
+      CREATEDATETIME TIMESTAMP_NTZ(9),
+      UPDATEDATETIME TIMESTAMP_NTZ(9),
+      VERSION NUMBER(38,0),
+      FIRSTNAME VARCHAR(16777216),
+      JOBTITLE VARCHAR(16777216),
+      LASTNAME VARCHAR(16777216),
+      MIDDLENAME VARCHAR(16777216),
+      ISCAREADMIN BOOLEAN,
+      ISARCHIVED BOOLEAN,
+      ADDRESSID VARCHAR(16777216),
+      ENTERPRISEID VARCHAR(16777216),
+      ISDELETED BOOLEAN
+);
+
+```
+
+---
+
+I copied this definition from Snowflake, and I would recommend that you always include both the name and type of the fields to get better results. This is especially helpful if you want to make sure ChatGPT correctly distinguishes between text, number, boolean or array fields, which may not be clear from the name alone.
+
+## Start out with simple prompts
+
+We all know how ChatGPT can digest very complex prompts, but as this is a tool for you to get the best results for your use case, I would recommend starting with simple, but specific, prompts.
+
+Opening ChatGPT with GPT4, my first prompt is usually along these lines:
+
+<Lightbox src="/img/blog/2023-07-17-GPT-and-dbt-test/image1.png" title="ChatGPT request for YAML file" />
+
+And the output of this simple prompt is nothing short of amazing:
+
+```yaml
+version: 2
+
+models:
+  - name: STAFF_MEMBER
+    description: This table contains information about the staff members.
+    columns:
+      - name: ID
+        description: The unique identifier for the staff member.
+        tests:
+          - unique
+          - not_null
+      - name: CREATEDATETIME
+        description: The timestamp when the record was created.
+        tests:
+          - not_null
+      - name: UPDATEDATETIME
+        description: The timestamp when the record was last updated.
+      - name: VERSION
+        description: Version number of the record.
+      - name: FIRSTNAME
+        description: The first name of the staff member.
+        tests:
+          - not_null
+      - name: JOBTITLE
+        description: The job title of the staff member.
+        tests:
+          - not_null
+      - name: LASTNAME
+        description: The last name of the staff member.
+        tests:
+          - not_null
+      - name: MIDDLENAME
+        description: The middle name of the staff member.
+      - name: ISCARADMIN
+        description: Boolean value indicating if the staff member is a care administrator.
+        tests:
+          - accepted_values:
+              values: ['true', 'false']
+      - name: ISARCHIVED
+        description: Boolean value indicating if the staff member record is archived.
+        tests:
+          - accepted_values:
+              values: ['true', 'false']
+      - name: COMMUNITYID
+        description: Identifier for the community of the staff member.
+      - name: ENTERPRISEID
+        description: Identifier for the enterprise of the staff member.
+      - name: ISDELETED
+        description: Boolean value indicating if the staff member record is deleted.
+        tests:
+          - accepted_values:
+              values: ['true', 'false']
+```
+
+Back in my day (5 months ago), ChatGPT with GPT 3.5 didn’t have much context on what a dbt-style YAML file was, so the result was a funky, non-dbt YAML format. If you are using GPT 3.5 and the results you get from the first prompt don’t compile in dbt, I would recommend an extra step I took which is to pass your understanding of what a dbt-style YAML file is. You can do this simply by copy-pasting a piece of a publicly available YAML file like [this one from the Jaffle Shop example](https://github.com/dbt-labs/jaffle_shop/blob/main/models/schema.yml), or a piece of YAML from your own code base.
+
+A prompt for it would look something like:
+
+<Lightbox src="/img/blog/2023-07-17-GPT-and-dbt-test/image2.png" title="ChatGPT request to memorize dbt-style YAML" />
+
+## Specify details on generic tests in your prompts
+
+dbt has four built-in generic tests: unique, not_null, accepted_values and relationships. My approach to adding details on these to the prompts is one of ‘more is better’, as in it’s faster to just delete tests that I don’t need in certain fields than having to add them manually.
+
+ChatGPT will throw unique and not_null tests for mostly all columns that have ‘id’ in their name, and in my experience it adds them to a few more columns than it should, which is usually quick to fix.
+
+Accepted_values and relationships are slightly trickier but the model can be adjusted to account for these in fairly simple ways. My tips are:
+
+- accepted_values: add extra detail to the schema like an extra keyword like ‘CATEGORICAL’ after the data type
+- relationships: bind the relationships to fields with specific names. For better results, pass on a list of valid table names from your schema that it can refer to.
+
+One way of doing this is with a prompt like this:
+
+<Lightbox src="/img/blog/2023-07-17-GPT-and-dbt-test/image3.png" title="ChatGPT request to write dbt-style YAML with generic tests" />
+
+Which results in the following output:
+
+```yaml
+version: 2
+
+models:
+  - name: STAFF_MEMBER
+    description: This table contains information about the staff members.
+    columns:
+      - name: ID
+        description: The unique identifier for the staff member.
+        tests:
+          - unique
+          - not_null
+      - name: CREATEDATETIME
+        description: The timestamp when the record was created.
+        tests:
+          - not_null
+      - name: UPDATEDATETIME
+        description: The timestamp when the record was last updated.
+      - name: VERSION
+        description: Version number of the record.
+      - name: FIRSTNAME
+        description: The first name of the staff member.
+        tests:
+          - not_null
+      - name: JOBTITLE
+        description: The job title of the staff member. This is a categorical field.
+        tests:
+          - not_null
+          - accepted_values:
+              values: ['Job Title 1', 'Job Title 2', 'Job Title 3'] # replace these with actual job titles
+      - name: LASTNAME
+        description: The last name of the staff member.
+        tests:
+          - not_null
+      - name: MIDDLENAME
+        description: The middle name of the staff member.
+      - name: ISCARADMIN
+        description: Boolean value indicating if the staff member is a care administrator.
+        tests:
+          - accepted_values:
+              values: ['true', 'false']
+      - name: ISARCHIVED
+        description: Boolean value indicating if the staff member record is archived.
+        tests:
+          - accepted_values:
+              values: ['true', 'false']
+      - name: COMMUNITYID
+        description: Identifier for the community of the staff member.
+        tests:
+          - relationships:
+              to: STAGING.COMMUNITY.ID # replace with actual reference table
+              field: ID
+      - name: ENTERPRISEID
+        description: Identifier for the enterprise of the staff member.
+        tests:
+          - relationships:
+              to: STAGING.ENTERPRISE.ID # replace with actual reference table
+              field: ID
+      - name: ISDELETED
+        description: Boolean value indicating if the staff member record is deleted.
+        tests:
+          - accepted_values:
+              values: ['true', 'false']
+```
+
+Even though it doesn’t magically guess the categories of your variables or your relationships, it is definitely helpful to have some placeholders in the right places.
+
+As an add-on, giving the model a short description of the data models and the tables you are working with will help it fine tune your results.
+
+## Wrap-Up
+
+Creating documentation is still a very manual job, and this approach only works for one table at a time (maybe you can be the one leveraging the OpenAI API and creating a webapp that processes multiple tables at once?). However, ChatGPT can clearly cut a lot of time in these tasks.
+
+I hope that these simple tips help you be more motivated and efficient in creating documentation and tests for your data models. And remember: verbosity in - verbosity out!
diff --git a/website/blog/authors.yml b/website/blog/authors.yml
@@ -373,6 +373,16 @@ pat_kearns:
   name: Pat Kearns
   organization: dbt Labs
 
+pedro_brito_de_sa:
+  image_url: /img/blog/authors/pedro_brito.jpeg
+  job_title: Product Analyst
+  links:
+    - icon: fa-linkedin
+      url: https://www.linkedin.com/in/pbritosa/
+  name: Pedro Brito de Sa
+  organization: Sage
+
+
 rastislav_zdechovan:
   image_url: /img/blog/authors/rastislav-zdechovan.png
   job_title: Analytics Engineer

diff --git a/...2021-11-23-on-the-importance-of-naming.md → website/blog/src.md b/...2021-11-23-on-the-importance-of-naming.md → website/blog/src.md
diff --git a/website/docs/docs/build/about-metricflow.md b/website/docs/docs/build/about-metricflow.md
@@ -60,6 +60,7 @@ Metrics, which is a key concept, are functions that combine measures, constraint
 
 MetricFlow supports different metric types:
 
+- [Cumulative](/docs/build/cumulative) &mdash;  Aggregates a measure over a given window.
 - [Derived](/docs/build/derived) &mdash; An expression of other metrics, which allows you to do calculations on top of metrics.
 - [Ratio](/docs/build/ratio) &mdash; Create a ratio out of two measures, like revenue per customer.
 - [Simple](/docs/build/simple) &mdash; Metrics that refer directly to one measure. 

diff --git a/website/docs/docs/build/cumulative-metrics.md b/website/docs/docs/build/cumulative-metrics.md
@@ -8,6 +8,12 @@ tags: [Metrics, Semantic Layer]
 
 Cumulative metrics aggregate a measure over a given window. If no window is specified, the window is considered infinite and accumulates values over all time.
 
+:::info MetricFlow time spine required
+
+You will need to create the [time spine model](/docs/build/metricflow-time-spine) before you add cumulative metrics.
+
+:::
+
 ```yaml
 # Cumulative metrics aggregate a measure over a given window. The window is considered infinite if no window parameter is passed (accumulate the measure over all time)
 metrics:
@@ -24,7 +30,7 @@ metrics:
 
 ### Window options
 
-This section details examples for when you specify and don't specify window options.
+This section details examples of when you specify and don't specify window options.
 
 <Tabs>
 
@@ -56,7 +62,7 @@ metrics:
     window: 7 days
 ```
 
-From the sample yaml above, note the following: 
+From the sample YAML above, note the following: 
 
 * `type`: Specify cumulative to indicate the type of metric. 
 * `type_params`: Specify the measure you want to aggregate as a cumulative metric. You have the option of specifying a `window`, or a `grain to date`.  
@@ -142,7 +148,7 @@ metrics:
 ```yaml
 metrics: 
   name: revenue_monthly_grain_to_date #For this metric, we use a monthly grain to date 
-  description: Monthly revenue using a grain to date of 1 month (think of this as a monthly resetting point) 
+  description: Monthly revenue using grain to date of 1 month (think of this as a monthly resetting point) 
   type: cumulative 
   type_params: 
     measures: 

diff --git a/website/docs/docs/build/incremental-models.md b/website/docs/docs/build/incremental-models.md
@@ -57,6 +57,7 @@ from raw_app_data.events
 {% if is_incremental() %}
 
   -- this filter will only be applied on an incremental run
+  -- (uses > to include records whose timestamp occurred since the last run of this model)
   where event_time > (select max(event_time) from {{ this }})
 
 {% endif %}
@@ -137,6 +138,7 @@ from raw_app_data.events
 {% if is_incremental() %}
 
   -- this filter will only be applied on an incremental run
+  -- (uses >= to include records arriving later on the same day as the last run of this model)
   where date_day >= (select max(date_day) from {{ this }})
 
 {% endif %}

diff --git a/website/docs/docs/build/metrics-overview.md b/website/docs/docs/build/metrics-overview.md
@@ -25,10 +25,10 @@ This page explains the different supported metric types you can add to your dbt
 - [Ratio](#ratio-metrics) — Create a ratio out of two measures. 
 -->
 
-<!--not supported for this release
+
 ### Cumulative metrics 
 
-[Cumulative metrics](/docs/build/cumulative) aggregate a measure over a given window. Note that if no window is specified, the window would accumulate the measure over all time. 
+[Cumulative metrics](/docs/build/cumulative) aggregate a measure over a given window. If no window is specified, the window would accumulate the measure over all time. **Note**m, you will need to create the [time spine model](/docs/build/metricflow-time-spine) before you add cumulative metrics.
 
 ```yaml
 # Cumulative metrics aggregate a measure over a given window. The window is considered infinite if no window parameter is passed (accumulate the measure over all time)
@@ -43,7 +43,6 @@ metrics:
     #Omitting window will accumulate the measure over all time
     window: 7 days
 ```
--->
 ### Derived metrics
 
 [Derived metrics](/docs/build/derived) are defined as an expression of other metrics. Derived metrics allow you to do calculations on top of metrics. 
@@ -145,7 +144,9 @@ You can set more metadata for your metrics, which can be used by other tools lat
 ## Related docs
 
 - [Semantic models](/docs/build/semantic-models)
+- [Cumulative](/docs/build/cumulative)
 - [Derived](/docs/build/derived)
 
 
 
+
diff --git a/website/docs/faqs/Docs/modify-owner-column.md b/website/docs/faqs/Docs/modify-owner-column.md
@@ -8,7 +8,7 @@ id: modify-owner-column
 
 Due to the nature of the field, you won't be able to change the owner column in your generated documentation. 
 
-The _owner_ field in `dbt-docs` is pulled from database metdata (`catalog.json`), meaning the owner of that table in the database. With the exception of exposures, it's not pulled from an `owner` field set within dbt.
+The _owner_ field in `dbt-docs` is pulled from database metadata (`catalog.json`), meaning the owner of that table in the database. With the exception of exposures, it's not pulled from an `owner` field set within dbt.
 
 Generally, dbt's database user owns the tables created in the database. Source tables are usually owned by the service responsible for ingesting/loading them. 
 

diff --git a/website/docs/guides/legacy/debugging-schema-names.md b/website/docs/guides/legacy/debugging-schema-names.md
@@ -16,7 +16,7 @@ You can also follow along via this video:
 Do a file search to check if you have a macro named `generate_schema_name` in the `macros` directory of your project.
 
 #### I do not have a macro named `generate_schema_name` in my project
-This means that you are using dbt's default implementation of the macro, as defined [here](https://github.com/dbt-labs/dbt-core/blob/main/core/dbt/include/global_project/macros/get_custom_name/get_custom_schema.sql#L17-L30)
+This means that you are using dbt's default implementation of the macro, as defined [here](https://github.com/dbt-labs/dbt-core/blob/main/core/dbt/include/global_project/macros/get_custom_name/get_custom_schema.sql#L47C1-L60)
 
 ```sql
 {% macro generate_schema_name(custom_schema_name, node) -%}

diff --git a/website/docs/reference/commands/clone.md b/website/docs/reference/commands/clone.md
@@ -23,7 +23,7 @@ dbt clone --state path/to/artifacts
 # clone one_specific_model of my models from specified state to my target schema(s)
 dbt clone --select one_specific_model --state path/to/artifacts
 
-# clone all of my models from specified state to my target schema(s) and recreate all pre-exisiting relations in the current target
+# clone all of my models from specified state to my target schema(s) and recreate all pre-existing relations in the current target
 dbt clone --state path/to/artifacts --full-refresh
 
 # clone all of my models from specified state to my target schema(s), running up to 50 clone statements in parallel
@@ -36,4 +36,4 @@ Unlike deferral, `dbt clone` requires some compute and creation of additional ob
 
 For example, by creating actual data warehouse objects, `dbt clone` allows you to test out your code changes on downstream dependencies _outside of dbt_ (such as a BI tool). 
 
-As another example, you could `clone` your modified incremental models as the first step of your dbt Cloud CI job to prevent costly `full-refresh` builds for warehouses that support zero-copy cloning.
+As another example, you could `clone` your modified incremental models as the first step of your dbt Cloud CI job to prevent costly `full-refresh` builds for warehouses that support zero-copy cloning.
diff --git a/website/docs/reference/snowflake-permissions.md b/website/docs/reference/snowflake-permissions.md
@@ -15,9 +15,11 @@ grant usage on schema database.an_existing_schema to role role_name;
 grant create table on schema database.an_existing_schema to role role_name;
 grant create view on schema database.an_existing_schema to role role_name;
 grant usage on future schemas in database database_name to role role_name;
+grant monitor on future schemas in database database_name to role role_name;
 grant select on future tables in database database_name to role role_name;
 grant select on future views in database database_name to role role_name;
 grant usage on all schemas in database database_name to role role_name;
+grant monitor on all schemas in database database_name to role role_name;
 grant select on all tables in database database_name to role role_name;
 grant select on all views in database database_name to role role_name;
 ```
diff --git a/website/sidebars.js b/website/sidebars.js
@@ -275,6 +275,7 @@ const sidebarSettings = {
               label: "Metrics",
               link: { type: "doc", id: "docs/build/metrics-overview"},
               items: [
+                "docs/build/cumulative",
                 "docs/build/derived",
                 "docs/build/ratio",
                 "docs/build/simple",

diff --git a/website/static/img/blog/2023-07-17-GPT-and-dbt-test/image1.png b/website/static/img/blog/2023-07-17-GPT-and-dbt-test/image1.png
diff --git a/website/static/img/blog/2023-07-17-GPT-and-dbt-test/image2.png b/website/static/img/blog/2023-07-17-GPT-and-dbt-test/image2.png
diff --git a/website/static/img/blog/2023-07-17-GPT-and-dbt-test/image3.png b/website/static/img/blog/2023-07-17-GPT-and-dbt-test/image3.png
diff --git a/website/static/img/blog/authors/pedro_brito.jpeg b/website/static/img/blog/authors/pedro_brito.jpeg