bugfix/hubspot-duplicates #5

fivetran-joemarkiewicz · 2024-10-21T23:21:23Z

PR to address some duplicate HubSpot records uncovered when testing the model on a new schema. Additionally, added unique and not null tests to the final unified rag output model.

models/intermediate/hubspot/int_rag_hubspot__deal_comment_document.sql

fivetran-joemarkiewicz · 2024-10-21T23:23:54Z

models/intermediate/hubspot/int_rag_hubspot__deal_comment_document.sql

@@ -45,14 +45,15 @@ engagement_emails as (
        engagement_email.email_to_email,
        engagement_email.email_cc_email,
        engagement_email.email_from_email as commenter_email,
-        contacts.contact_name as commenter_name
+        {{ fivetran_utils.string_agg(field_to_agg="contacts.contact_name", delimiter="','") }} as commenter_name


I found a few cases where there are multiple contacts associated with an engagement email. This resulted in fannout. The stringagg will ensure all parties are included in the resulting data, but also not cause any fannout.

Good catch.

fivetran-joemarkiewicz · 2024-10-21T23:24:32Z

models/intermediate/hubspot/int_rag_hubspot__deal_document.sql

+engagement_details as (
+    select
+        deal_id,
+        deal_name,
+        url_reference,
+        created_on,
+        source_relation,
+        {{ fivetran_utils.string_agg(field_to_agg="distinct engagement_type", delimiter="', '") }} as engagement_type,
+        {{ fivetran_utils.string_agg(field_to_agg="distinct contact_name", delimiter="', '") }} as contact_name,
+        {{ fivetran_utils.string_agg(field_to_agg="distinct created_by", delimiter="', '") }} as created_by,
+        {{ fivetran_utils.string_agg(field_to_agg="distinct company_name", delimiter="', '") }} as company_name
+    from engagement_detail_prep
+    group by 1,2,3,4,5


Similar to the previous model. The joins in the prev cte are not 1:1. So we need to do some creative aggregating to make sure we retain all the necessary information, but to not cause any fannouts.

fivetran-joemarkiewicz · 2024-10-21T23:25:15Z

models/rag__unified_document.sql

@@ -1,8 +1,8 @@
 {{
    config(
        materialized='table' if unified_rag.is_databricks_sql_warehouse() else 'incremental',
-        partition_by = {'field': 'most_recent_chunk_update', 'data_type': 'date', 'granularity': 'month'}
-            if target.type not in ['spark', 'databricks'] else ['most_recent_chunk_update'],
+        partition_by = {'field': 'update_date', 'data_type': 'date'}


Ran into some issues with the incremental logic on BQ. These changes helped address those issues although it did require adding a new field (which has been documented and docs regen'd).

fivetran-avinash

@fivetran-joemarkiewicz good catches, lgtm

bugfix/hubspot-duplicates

d984fe1

fivetran-joemarkiewicz self-assigned this Oct 21, 2024

fivetran-joemarkiewicz requested a review from fivetran-avinash October 21, 2024 23:21

fivetran-joemarkiewicz commented Oct 21, 2024

View reviewed changes

fivetran-avinash approved these changes Oct 22, 2024

View reviewed changes

fivetran-joemarkiewicz merged commit 956b0d6 into main Oct 22, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bugfix/hubspot-duplicates #5

bugfix/hubspot-duplicates #5

fivetran-joemarkiewicz commented Oct 21, 2024

fivetran-joemarkiewicz Oct 21, 2024

fivetran-avinash Oct 22, 2024

fivetran-joemarkiewicz Oct 21, 2024

fivetran-joemarkiewicz Oct 21, 2024

fivetran-avinash left a comment

bugfix/hubspot-duplicates #5

bugfix/hubspot-duplicates #5

Conversation

fivetran-joemarkiewicz commented Oct 21, 2024

fivetran-joemarkiewicz Oct 21, 2024

Choose a reason for hiding this comment

fivetran-avinash Oct 22, 2024

Choose a reason for hiding this comment

fivetran-joemarkiewicz Oct 21, 2024

Choose a reason for hiding this comment

fivetran-joemarkiewicz Oct 21, 2024

Choose a reason for hiding this comment

fivetran-avinash left a comment

Choose a reason for hiding this comment