Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get RxNorm data models up to speed with dbt best practices #280

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

lprzychodzien
Copy link
Collaborator

Resolves #270

Explanation

Changes rxnorm staging and intermediate queries to using dbt's jinja table references.

Most important is that it sets intermediate models to materialize as tables (vs views).

Rationale

dbt mart models are made up of complex logic that should be captured in the intermediate models. The issues already that we ran into (issue #270 ) was that performing these aggregations for each query took 10+ minutes. Therefore, bringing this aggregation into a separated intermediate model to be materialized as a table is a good solution. The materialization of these intermediate tables can take awhile but will significantly speed up queries.

Additional work can be done to optimize the queries that build out intermediate models in the future.

Tests

  1. What testing did you do? dbt run --full-refresh
  2. Attach testing logs inside a summary block:
testing logs

@@ -35,7 +35,7 @@ models:
+materialized: view
intermediate:
+schema: sagerx
+materialized: view
+materialized: table
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most impactful change here

Copy link
Contributor

@leemlb06pmi leemlb06pmi Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason we maybe wouldn't want to make this change by itself would be that the materialized table has no schedule for being refreshed, other than when we run the DAGs that rebuild everything. I'm leaving a comment on the original issue discussion as to another possible solution

Also, this particular change would materialize ALL the models in this layer as tables which might become slightly taxing on the DAGs that build this layer (definitely Rxnorm, potentially others?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry im not following your concerns.

  1. The intermediate tables are updated by the rxnorm DAG with the final "transform" task. So the tables should be updated after running the DAG which I believe is the correct workflow. Happy to discuss.
  2. I dont think we should see it as being taxing on the DAG to run this, yes the DAG would be doing more work, but we are moving the computation upstream so that the queries run faster/cheaper.
  3. The only issue I have with this is that all of the intermediate tables will be cluttering the database, but again that might be warranted as we work through the data and business logic.

@jrlegrand
Copy link
Member

Man you did tons of work here. Great job. Fundamental question themes:

  1. Should we title ctes specifically to what they represent even if we are just pulling in rxnconso multiple times? Or should we have one rxnconso cte per model and use aliases and leave everything else the same? Or should we name them what they represent and pull the where clause into the cte? Or should we make like other "base" staging tables with them? Or should we just look for opportunities to reference these staging models from other int/stg models instead of going back to source all the time?
  2. I'm def referencing source tables from a lot of intermediate models - shame on me. I need to fix this.
  3. Minor formatting changes still needed to satisfy my OCD (lowercase SQL commands and tab indent in some places).
  4. Some ref's still in select from instead of within a cte first.
  5. Maybe we should build a dbt macro for the active / prescribable columns - prob in a separate issue.

Another major thing I will look at is taking a 50,000 foot view of the entire RxNorm data model - it's hard to see when zoomed in on a given intermediate model, but I think I re-wrote a lot of the code in other intermediate models inside of intermediate models. So I think there's an opportunity to ref intermediate models instead of re-writing that SQL.

@lprzychodzien lprzychodzien force-pushed the rxnorm branch 2 times, most recently from ca79a75 to 06065c5 Compare May 17, 2024 16:12
@jrlegrand jrlegrand self-requested a review July 16, 2024 19:28
@jrlegrand jrlegrand self-assigned this Jul 16, 2024
@jrlegrand
Copy link
Member

I'd like to start moving toward styling our SQL like the dbt styleguide.

https://docs.getdbt.com/best-practices/how-we-style/2-how-we-style-our-sql

@jrlegrand jrlegrand changed the title Rxnorm dbt update and intermediate tables Get RxNorm data models up to speed with dbt Jul 20, 2024
@jrlegrand jrlegrand changed the title Get RxNorm data models up to speed with dbt Get RxNorm data models up to speed with dbt best practices Jul 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Optimize RxNorm
4 participants